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ABSTRACT 


The  relationship  between  the  structure  of  a  neural  network  and  its  ability  to 
perform  nonlinear  mapping  is  analyzed.  A  new  algorithm,  called  the  conjugate  gradient 
optimization  method,  for  calculating  the  weights  and  thresholds  of  a  neural  network  is 
presented.  The  performance  of  the  conjugate  gradient  algorithm  is  then  compared  to  the 
well  known  backpropagation  method  and  shown  to  be  more  computationally  efficient. 
A  neural  network  using  the  conjugate  gradient  algorithm  is  then  applied  to  three  simple 
examples  to  demonstrate  its  signal  processing  capabilities.  The  first  example  illustrates 
the  ability  of  the  neural  network  to  perform  classification.  The  second  compares  the 
performance  of  a  one-step  linear  predictor  to  a  neural  network  for  a  nonlinear  chaotic- 
time  series.  The  neural  network  predictor  is  shown  to  provide  much  greater  accuracy 
than  its  linear  counterpart.  The  final  application  presented  demonstrates  the  ability  of 
a  neural  network  to  perform  channel  equalization  for  a  nonminimum  phase  channel.  Its 
performance  is  then  compared  to  its  linear  equivalent. 
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I.  INTRODUCTION 


Artificial  neural  networks  have  been  studied  for  many  years  in  the  hope  of 
achieving  human-like  performance.  Neural  networks  consist  of  highly  connected  sets 
of  relatively  simple  processing  elements.  Computations  are  performed  collectively  by 
the  entire  network  with  the  activity  distributed  over  all  the  processing  elements.  This 
parallel  distributed  processing  provides  neural  networks  with  the  potential  to  solve 
complex  pioblems  more  quickly  than  the  currently  well  known  present  serial  methods. 
The  nonlinear  nature  and  simple  structure  of  neural  networks  provide  a  formalism 
for  the  study  of  nonlinear  signal  processing. 

The  application  of  neural  networks  to  signal  processing  involves  developing  an 
understanding  of  the  relationship  between  the  structure  of  a  neural  network  and  its 
ability  to  perform  the  desired  input-to-output  mapping.  A  neural  network’s  structure 
is  defined  by  the  number  and  type  of  processing  elements  in  the  network,  the  values 
of  the  weights  that  connect  the  processing  elements  together,  and  a  threshold  value 
associated  with  each  processing  element.  Past  work  has  lead  to  a  large  variety  of 
neural  network  models.  The  models  include  the  Hopfield  network,  the  single-  and 
multi-layer  pereeptvon  Jietworks,  the  reduced  Coulomb  energy  (RCE)  classifier ,  and 
the  adaptive  resonance  theory  (ART)  model  [Ref.  l:pp.  65-73].  Each  model  differs 
in  its  structure  and  the  manner  in  which  the  weights  and  thresholds  of  the  network 
are  derived.  One  current  method  for  calculating  the  weights  and  thresholds  of  a 
feedforward  multilayer  neural  network,  called  the  backpropagation  method,  uses  a 
steepest  descent  method  to  iteratively  adapt  the  weights  and  thresholds  of  the  network 
[Ref.  ‘2:p.  127].  This  method  has  generally  been  shown  to  be  slow  to  converge  to  the 
optimal  set  of  weights  and  thresholds  for  a  given  problem  [Ref.  l:p.  300].  The 
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objectives  of  this  thesis  research  were  therefore: 


•  Investigate  the  relationship  between  the  structure  of  a  neural  network  and  its 
ability  to  perform  input-output  mapping. 

•  Develop  an  alternative  to  the  backpropagation  method  that  converges  more 
quickly  to  the  optimal  set  of  weights  and  thresholds  for  any  given  problem. 

•  Compare  the  performance  of  a  neural  network  to  its  linear  counterpart  for  some 
representative  signal  processing  applications. 

Chapter  II  provides  a  general  overview  of  the  theory  of  neural  networks.  A 
graphical  approach  is  employed  to  demonstrate  the  ability  of  neural  networks  to 
perform  nonlinear  mapping  for  various  network  configurations.  The  results  are  then 
related  to  a  theorem  by  Kolmogorov.  The  backpropagation  method  for  calculating 
the  weights  and  thresholds  of  the  neural  network  is  also  introduced. 

Chapter  III  deals  with  the  derivation  of  an  alternative  algorithm  to  the  back- 
propagation  method  for  calculating  the  weights  and  thresholds  of  a  neural  network. 
The  conjugate  gradient  optimization  method  is  presented  and  then  applied  to  the  neu¬ 
ral  network  model.  The  Fibonacci  line  search  method  used  in  conjunction  with  the 
conjugate  gradient  method  is  also  discussed.  The  final  section  of  the  chapter  presents 
details  concerning  actual  implementation  of  the  algorithm  to  include  experimentally 
derived  parameters. 

Chapter  IV  presents  the  results  of  the  thesis  research.  The  conjugate  gradient 
algorithm's  performance  is  compared  to  the  backpropagation  method  and  is  shown 
to  be  more  computationally  efficient.  A  neural  network  using  the  conjugate  gradi¬ 
ent  algorithm  is  then  applied  to  three  simple  examples  to  validate  the  performance 
of  the  new  algorithm  and  to  demonstrate  the  types  of  tasks  that  a  neural  network 
can  perform.  The  first  example  illustrates  the  neural  network’s  ability  to  perform 
classification.  A  two  input  neural  network  is  successfully  "taught ”  to  differentiate  be¬ 
tween  sets  of  points  falling  inside  and  outside  a  circle.  The  second  example  compares 


the  performance  of  a  one-step  linear  predictor  to  a  neural  network  for  a  nonlinear 
chaotic  time  series  generated  using  the  Feigenbaum  logistic  function.  This  applica¬ 
tion  demonstrates  the  nonlinear  mapping  ability  of  the  neural  network.  The  neural 
network  predictor  is  shown  to  provide  much  greater  accuracy  than  its  linear  counter¬ 
part.  The  final  application  presented  demonstrates  the  ability  of  a  neural  network  to 
perform  channel  equalization  for  a  nonminimum  phase  channel.  Its  performance  is 
compared  to  its  linear  equivalent  and  is  shown  to  provide  superior  performance. 

Chapter  V  contains  the  overall  conclusions  of  the  thesis  research  and  provides 
recommendations  for  future  research. 


II.  FUNDAMENTALS  -  HOW  NEURAL 
NETWORKS  WORK 


A.  THE  BASIC  BUILDING  BLOCK 

A  neural  network  is  a  system  of  relatively  simple  processing  elements  whose 
function  is  determined  by  its  network  structure,  connection  weights,  and  the  transfer 
iunction  of  each  neuron.  Figure  2.1  shows  a  single  artificial  neuron,  the  fundamental 
building  block  for  all  neural  networks.  A  set  of  inputs  Xj,  £2, .  • . ,  x„  are  applied 
through  a  set  of  associated  connection  weights  W\,W2, . . .  ,wn  to  the  neuron. 


The  inputs  correspond  to  the  stimulation  levels  and  the  weights  to  the  synap¬ 
tic  strengths  of  a  biological  neuron.  The  neuron  sums  the  weighted  inputs,  adds  a 
threshold  value,  and  applies  the  result  to  the  neuron’s  transfer  function  /(x).  This 
operation  can  be  expressed  as 


or  in  vector  notation 


2  =  /  (wTx  -f  6 ) 


(2.2) 


where  x  is  a  column  vector  of  inputs,  w  the  corresponding  column  vector  of  weights, 
and  6  the  neuron’s  threshold  value. 

B.  THE  TRANSFER  FUNCTION 

A  number  of  possibilities  arise  for  selection  of  an  appropriate  transfer  function. 
These  include  most  notably:  the  signum  function,  the  linear  function,  and  the  sigmoid 
function.  Initial  research  conducted  in  the  1950’s  and  1960’s  by  Rosenblat,  Minsky 
and  others  used  the  signum  function  shown  in  Figure  2.2  [Ref.  3).  The  signum  function 
will  be  used  for  a  preliminary  discussion  of  how  neural  networks  operate. 


Artificial  neurons  using  the  signum  transfer  function  were  referred  to  as  percep- 
trons  [Ref.  3].  The  signum  transfer  function  causes  the  output  of  the  perceptron  to 
take  one  of  two  discrete  values.  The  point  at  which  the  neuron  switches  from  low  to 
high  or  high  to  low  is  determined  by  the  input  weights  and  the  perception's  thresh¬ 
old  value.  It  has  been  shown  that  a  single  perceptron  has  the  ability  to  distinguish 
between  two  classes  of  inputs  [Ref.  4:p.  13].  This  is  demonstrated  in  Figure  2.3  for  a 
two  input  network. 


5 


The  combination  of  weights  (w \  and  wz)  and  the  offset  (0)  define  a  line  where  the 
output  of  the  network  ( z )  is  high  for  the  class  of  inputs  falling  on  one  side  of  the  line 
and  low  for  the  second  class  of  inputs  falling  on  the  other  side.  If  there  are  n  inputs 
to  a  single  perceptron,  as  pictured  in  Figure  2.1,  the  perception  can  construct  an  n 
dimensional  hyperplane  separating  the  two  classes  of  inputs.  Input  classes  that  cannot 
be  separated  by  a  simple  hyperplane  therefore  cannot  be  accurately  differentiated  by 
a  single  perceptron. 

This  problem  can  be  remedied  by  cascading  the  perceptrons  into  several  layers. 
This  type  of  network  topology  is  called  a  feedforward  network  because  the  output 
from  the  previous  layer  is  fed  forward  to  only  the  neurons  in  the  next  layer  of  the 
network.  By  adding  additional  layers,  more  complex  boundaries  can  be  defined.  A 
two  layer  network  is  capable  of  defining  decision  regions  that  are  convex  or  concave 
in  shape.  For  the  two  input  case  shown  in  Figure  2.4,  each  perceptron  in  the  first 
layer  defines  a  boundary  line.  A  single  second  layer  perceptron  weights  and  combines 
the  outputs  from  the  first  layer  perceptrons  to  produce  the  two  decision  regions.  As 
pictured  in  Figure  2.4  a  two  layer  network  can  also  define  a  single  enclosed  region. 
With  the  addition  of  a  third  layer,  disjoint  enclosed  regions  can  be  combined  to  create 
a  decision  map  of  any  arbitrary  complexity,  given  a  sufficient  number  of  perceptrons 
in  each  layer.  This  is  illustrated  in  Figure  2.5. 

The  performance  of  a  multilayer  perceptron  network  using  the  signum  transfer 
function  is  satisfactory  provided  the  desired  output  from  the  network  is  limited  to  two 
discrete  values  (i.e.,  high  or  low).  This  would  be  appropriate  for  a  binary  classifier 
system,  where  each  output  would  represent  one  of  two  classes,  i.e.,  a  binary  value. 
It  does  not,  however,  provide  sullicient  resolution  for  analog  (continuously  valued)  or 
the  corresponding  discrete  valued  output  functions  associated  with  most  other  signal 
processing  applications. 
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Figure  2.3:  Single  neuron  and  associated  decision  regions 


Figure  2.4:  Two  layer  network  and  associated  decision  regions 


Figure  2.5:  Three  layer  network  and  associated  decision  regions 


One  example  of  a  transfer  function  that  would  be  capable  of  pioviding  such  a 
continuously  variable  output  is  the  linear  transfer  function.  In  this  case,  the  output  of 
the  artificial  neuron  would  simply  be  the  weighted  sum  of  the  inputs  plus  the  neuron’s 
threshold  value.  This  car  be  expressed  as 

5  =  /(x)  =  J2wixi  +  0  (2-3) 

i 

or  in  vector  notation 

z  =  /(x)  =  w7x  +  0.  (2.4) 

This  is  the  transfer  function  used  by  Widrow  and  Hoff  in  their  development  of  the 
adaptive  linear  (adaline)  and  multiple  adaptive  linear  (madaline)  filters  (Hof.  5:p. 
10].  A  great  deal  has  been  written  concerning  research  and  applications  of  the 
adaptive  linear  filter  although  it  has  not  often  been  referred  to  as  a  neural  model 
(Ref.  6], (Ref.  7], (Ref.  8].  One  key  feature  of  the  linear  neural  network  is  that  there  is 
no  functional  difference  between  a  multilayer  and  a  single  layer  network.  For  example, 
for  the  simple  two  layer  network  in  Figure  2.b  the  output  of  the  first,  layer  neurons 
can  be  written  as 


/i(*i 

,X2)  =  U>iZi  +  W2X2  +  0\ 

(2.5) 

and 

•  a'2)  =  iv3Xi  +  u\j.r2  +  02. 

(2.6) 

The  output  of  the  network  can  then  be  written  as 

Wsfl(xl,X2)  +  Wsf2(x  \,X2)  +  O3. 

(2.7) 

After 

some  algebraic  manipulation  and  substitution,  the  final  result  is 

=  «j5(u*i  +  U'3):i 

-1  +  w<i{  u-'2  +  tCj  }x2  +  (w$0\  +  W$02  +  0z). 

(2.8) 
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From  the  above  discussion,  it  is  clear  that,  regardless  of  the  number  of  layers  in  the 
network,  the  network  can  always  be  reduced  to  a  single  layer  netv.ork.  Essentially 
then,  the  linear  adaptive  filter  is  nothing  more  than  the  linear  version  of  a  single  layer 
neural  network. 

A  third  transfer  function  which  has  been  recently  popularized  by  Rumelhart  et 
al.  [Ref.  9]  is  called  the  sigmoid  function.  It  is  defined  by  the  equation 


/(*) 


1 

1  +  e~: 


(2.9) 


where 

S  =  YlWiXi  +  0  =  WTX  +  6.  (2.10) 

i 

The  sigmoid  function,  pictured  in  Figure  2.7,  has  a  shape  which  would  appear  to  fall 
somewhere  between  the  linear  transfer  function  and  the  signum  transfer  function. 

Its  output  is  limited  to  a  continuous  range  of  values  between  zero  and  one.  For  values 
of  s  near  zero,  the  transfer  function  behaves  in  a  linear  fashion  with  a  constant  slope 
of  one.  If  the  input  weights  to  the  neuron  are  kept  sufficiently  small  and  the  range 
of  input  values  limited,  the  sigmoidal  artificial  neuron  can  be  made  to  appear  linear. 
Likewise,  by  using  large  values  fur  the  input  weights  u>,  the  values  for  ;  would  vary 
more  rapidly  and  the  sigmoidal  artificial  neuron  would  more  closely  approximate  the 
signum  function.  As  a  result,  the  output  of  the  network  can  be  made  to  approximate 
both  linear  and  nonlinear  combinations  of  the  inputs  depending  on  the  values  of  the 
network’s  weights  (w)  and  thresholds  (0). 

A  theorem  developed  by  Kolmogorov  and  described  in  Reference  10  provides 
further  insight  into  'he  potential  capabilities  of  a  multilayer  sigmoidal  neural  network. 
The  theorem  states  that  any  continuous  function  of  n  variables  can  be  represented 
using  only  linear  summatio.  s  and  nonlinear  but  continuously  increasing  functions  of 
only  one  variable.  This  would  indicate  that  a  three  layer  artificial  neuron  feedforward 
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Figure  2.6:  Two  layer  linear  network 
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Figure  2.7:  Neuron  transfer  functions 
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network  using  a  sigmoidal  transfer  function  is  capable  of  representing  any  nonlin¬ 
ear  multivariable  function.  The  theorem,  however,  does  not  indicate  the  number  of 
neurons  required  in  each  layer,  or  how  the  values  for  the  weights  should  be  derived. 

It  has  been  suggested  that  one  approach  to  representing  an  n-r!unensional  non¬ 
linear  function  using  neural  networks  might  be  by  a  weighted  combination  of  n- 
dimensional  ‘bumps’  [Ref.  11].  This  is  somewhat  analogous  to  the  Fourier  series 
representation  of  an  arbitrary  signal  where  weighted  combinations  of  sinusoids  of 
suitable  frequencies  are  used.  To  see  how  a  nonlinear  function  might  be  represented 
using  a  sigmoidal  neural  network,  let  us  look  at  the  case  where  we  have  a  nonlinear 
function  of  two  variables.  The  output  of  the  nonlinear  function  could  be  interpreted 
as  a  two  dimensional  surface  in  a  three  dimensional  space.  The  output  of  a  single 
sigmoidal  neuron  would  have  a  surface  like  that  pictured  in  Figure  2.8. 


Figure  2.8:  A  sigmoid  surface 

The  orientation  of  the  rising  slope  of  the  sigmoidal  surface  is  determined  by  the 
neuron’s  input  weights  (w).  Its  position  is  determined  by  its  threshold  (0)  value. 
The  height  of  the  surface  is  controlled  by  the  weight  connected  to  the  omput  of  the 
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neuron.  If  we  add  a  second  neuron  with  the  same  orientation,  but  a  slightly  different 
position  than  the  first  by  using  a  different  threshold  value  (0),  and  use  an  output 
weight  equal  to  but  opposite  in  sign  of  the  first,  we  can  form  a  ridge  as  shown  in 
Figure  2.9. 


A  second  ridge,  perpendicular  to  the  first,  can  then  be  constructed  by  adding 
two  additional  neurons  to  the  first  layer  and  selecting  appropriate  input  weight  values. 
The  sum  of  the  two  ridges  then  forms  the  surface  pictured  in  Figure  2.10.  The  weights 
connecting  the  outputs  of  the  first  layer  neurons  to  the  single  second  layer  neuron 
along  with  the  second  layer  neuron’s  threshold  value  can  then  be  adjusted  to  yield  a 
true  bump  shown  in  Figure  2.11. 

We  can  now  represent  any  surface  as  a  combination  of  these  bumps.  The  network 
topology  to  accomplish  this  would  consist  of  multiple  copies  of  two  layer  network  and 
a  single  third  layer  neuron  to  weight  and  sum  the  bumps.  The  resulting  surface  is 
pictured  in  Figure  2.12.  The  preceding  development  provides  some  insight  into  the 
number  of  neurons  required  in  each  layer  of  a  neural  network  to  adequately  represent 
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a  given  nonlinear  function.  A  given  function  might  be  more  efficiently  represented 
using  a  combination  of  sigmoidal  surfaces  or  ridges  rather  than  bumps.  The  better 
knowledge  one  has  of  the  function  to  be  represented  will  lead  to  a  better  decision 
concerning  the  neural  network  topology  required. 

C.  CALCULATION  OF  WEIGHTS  AND  THRESHOLDS 

The  burning  question  that  has  yet  to  be  addressed  concerning  the  feedforward 
sigmoidal  neural  network  is  how  do  we  calculate  the  weights  (w)  and  the  neuron 
thresholds  (0)  of  the  network  to  yield  a  satisfactory  representation  of  a  given  nonlinear 
function.  A  method  called  backpropagation,  developed  by  Rumelhart,  has  proven 
popular  and  has  been  demonstrated  to  work  fairly  well  [Ref.  2].  The  backpropagation 
method  uses  a  training  data  set  consisting  of  sets  of  inputs  and  a  desired  output  value. 
A  set  of  inputs  is  applied  to  the  neural  network  and  the  resulting  network  output  is 
compared  to  the  desired  value.  The  error  between  the  neural  network’s  output  and 
the  desired  output,  along  with  the  current  state  of  neural  network,  is  used  to  modify 
the  neural  network’s  weights  and  threshold  values.  The  state  of  the  neural  network  is 
defined  by  the  current  input  to  the  network,  its  weights,  thresholds,  and  each  neuron’s 
transfer  function.  The  backpropagation  method  attempts  to  minimize  the  sum  of  the 
squared  errors  over  the  entire  training  data  set.  This  can  be  expressed  as 

£  =  2>2(i)  =  £(»(<)- -’W)2  (2-11) 

i  t 

where  E  is  the  total  squared  error,  e(t)  is  the  network  output  error  for  the  <th  input 
set,  y(t)  is  the  desired  or  target  output  for  the  <th  input  set,  and  z{t)  is  the  actual 
output  of  the  neural  net  for  the  tth  input  set.  The  weights  and  the  thresholds  of  the 
network  are  iteratively  updated  in  proportion  to  the  gradient  of  the  total  squared 
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error,  E.  This  can  be  expressed  as 


w(n  -f  1)  =  w(n)  +  At u(n)  =  w(n)  —  -  ■  —  -  •  e  (2.12) 

ou>(n) 

and 

cp 

%  +  1)  =  «(n)  +  A0(n)  =  «(n)  -  •  £  (2.13) 

where  w(n)  and  6(n)  are  the  weights  and  thresholds  at  the  nth  iteration  of  the  algo¬ 
rithm,  Aw(n)  and  A 6(n)  are  the  incremental  changes  to  the  weights  and  thresholds, 
and  e  is  the  proportionality  constant  [Ref.  2:p.  130].  The  backpropagation  method 
gets  its  name  from  the  fact  that  the  error  at  the  output  of  the  network  is  propagated 
back  through  the  network  in  the  form  of  gradients  in  order  to  update  the  network’s 
weights  and  thresholds. 

The  backpropagation  method  is  essentially  a  steepest  descent  optimization  al¬ 
gorithm  which  uses  the  gradient  of  the  squared  error  function  to  modify  the  weights 
and  thresholds  of  the  neural  network  [Ref.  2:p.  127].  One  requirement  dictated  by 
this  gradient  method  is  that  the  transfer  function  of  the  neurons  be  continuously  dif¬ 
ferentiable  [Ref.  2:p.  131].  As  a  result,  this  method  cannot  be  used  with  the  signum 
transfer  function  because  of  its  discontinuity.  The  method,  however,  does  work  for 
the  linear  and  sigmoidal  transfer  function  cases. 

As  presented  above,  the  weights  and  thresholds  are  updated  after  a  complete 
pass  of  the  entire  training  data  set  through  the  network.  In  the  actual  implemen¬ 
tation  of  the  algorithm,  however,  Rumelhart  updates  the  weights  anct  thresholds  of 
the  network  after  each  input/desired  output  pair  is  applied  [Ref.  2:pp.  136-137]. 
His  rationale  for  doing  this  is  that  the  algorithm  converges  so  slowly  that  it  does 
not  affect  the  overall  convergence  rate,  and  that  it  is  more  gratifying  to  update  the 
weights  and  thresholds  more  frequently  [Ref.  2:p.  137].  As  Rummelhart  indicated, 
the  steepest  descent  method  is  extremely  slow  to  converge.  Jt  was  this  deficiency  that 
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led  to  the  development  of  this  thesis  project.  Lapedes  and  Farber  indicated  that  a 
related  optimization  method,  the  conjugate  gradient  algorithm,  yielded  a  significant 
improvement  in  the  convergence  rate  of  the  backpropagation  method  [Ref.  12).  The 
following  chapter  will  address  the  development  and  application  of  this  optimization 
method  to  a  feedforward  sigmoidal  neural  network. 


III.  DERIVATION  OF  THE  ADAPTATION 

ALGORITHM 


A.  THE  CONJUGATE  GRADIENT  METHOD 
1.  General  Description 

The  conjugate  gradient  method  is  an  iterative  method  for  optimizing  a 
set  of  coefficients  h  in  order  to  minimize  a  given  objective  function  J(h).  It  falls 
into  the  class  of  optimization  methods  that  apply  a  multidimensional  search  using 
derivatives  to  the  optimization  problem  [Ref.  13:pp.  2S9-316).  The  steepest  descent 
method,  which  Rumelhart  uses  for  adapting  the  feedforward  neural  network,  is  also 
a  member  of  this  class  [Ref.  2].  This  class  of  optimization  methods,  called  gradient 
methods,  treat  the  objective  function  J(h)  as  a  multidimensional  surface  over  which 
it  iteratively  searches  for  the  absolute  or  global  minimum  [Ref.  13:pp.  289-316].  The 
coefficients  h  are  the  multidimensional  coordinates  which  define  where  the  algorithm 
is  located  on  the  surface  during  any  particular  iteration.  This  class  of  optimization 
methods  require  that  the  objective  function  be  differentiable  with  respect  to  the 
coefficients  h  that  are  adapted  [Ref.  13:p.  289].  This  partial  derivative  is  called  the 
gradient  g  of  the  objective  function.  When  evaluated  for  a  given  set  of  coefficients  h, 
the  gradient  g  is  a  multidimensional  vector  which  is  tangent  to  the  objective  function 
surface  at  a  point  defined  by  the  coefficients  h.  This  vector  points  in  the  direction  of 
greatest  increase.  The  negative  of  the  gradient  (-g)  logically  points  downhill  in  the 
direction  of  -freat?st  decrease.  Thus,  the  gradient  vector  g  can  provide  a  direction 
along  the  surface  of  the  objective  function  in  which  to  search  for  the  global  minimum. 
The  advantage  of  gradient  methods  is  that  they  decompose  the  optimization  problem 
from  a  multidimensional  search  of  the  objective  function  surface  to  a  sequence  of  line 
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searches  along  directions  determined  by  the  gradient  vector  g. 

The  method  of  steepest  descent  uses  the  gradient  vector  g  directly  to  per¬ 
form  its  iterative  line  search  of  the  objective  function  surface  [Ref.  14:pp.  214-220]. 
Rumelhart  points  out  that  the  steepest  descent  method  works  well  when  the  objective 
function  surface  is  quadratic  or  bowl-shaped  with  a  single  global  minimum  [Ref.  2:p. 
132].  He  states,  however,  that  the  more  complex  objective  function  surfaces  associ¬ 
ated  with  multilayer  neural  networks  frequently  contain  many  local  minima  [Ref.  2:p. 
132].  As  a  result,  the  steepest  descent  method  can  become  trapped  in  one  of  these 
local  minima  yielding  a  less  than  optimal  solution.  This  is  because  the  magnitude  of 
gradient  vector  decreases  as  the  algorithm  approaches  a  local  minimum.  The  distance 
the  steepest  descent  algorithm  travels  for  a  given  iteration  is  a  function  of  a  constant 
times  the  magnitude  of  the  gradient.  Therefore,  as  the  magnitude  of  the  gradient  de¬ 
creases,  the  distance  the  algorithm  travels  along  the  surface  decreases.  Compounding 
the  problem  is  the  fact  that  each  successive  gradient  is  orthogonal  to  the  previous 
gradient.  This  causes  the  algorithm  to  zigzag  in  ever  smaller  steps  as  it  approaches 
the  bottom  of  a  local  minimum.  The  result  is  that  the  algorithm  becomes  trapped 
at  the  bottom  of  a  local  minimum  and  never  reaches  the  optimal  point  or  global 
minimum.  Use  of  a  constant  stepsize  also  causes  the  steepest  descent  algorithm  to  be 
extremely  slow  to  converge  [Ref.  13:pp.  290-291]. 

The  conjugate  gradient  approach  is  motivated  by  a  desire  to  accelerate 
the  convergence  rate  of  the  steepest  descent  method  without  greatly  increasing  the 
complexity  of  the  algorithm.  The  conjugate  gradient  method  uses  a  succession  of 
direction  vectors  d/,.  that  are  conjugate  to  the  gradient  vector  g;.  obtained  as  the 
algorithm  progresses.  The  direction  along  which  the  algorithm  searches,  d/.-,  is  a 
linear  combination  of  present  and  past  values  of  the  gradient  vector.  The  result  is 
that  the  gradient  vector  g*,.  is  orthogonal  to  the  subspace  I\.  which  is  defined  by 
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the  set  of  all  previous  direction  vectors  d0,di,. . .  ,d*_i.  Each  successive  iteration 
essentially  adds  an  additional  dimension  to  the  subspace  IV  The  distance  a k  that 
the  algorithm  travels  along  the  line  search  direction  d*  also  varies  for  each  iteration 
of  the  algorithm.  This  makes  the  method  only  slightly  more  complicated  than  the 
steepest  descent  method.  The  algorithm,  however,  does  not  become  trapped  in  local 
minima  as  easily  as  the  steepest  descent  method  and  converges  steadily  to  the  global 
minimum  or  optimal  set  of  coefficients  h*  [Ref.  13:pp.  297-316]. 

2.  Notation  Summary 

The  notation  used  to  describe  the  conjugate  gradient  method  is  as  follows: 

J(h)  Objective  function  to  be  minimized, 
h/t  Coefficient  vector  at  the  kth  iteration, 
g k  Gradient  vector  of  the  objective  function  at  the  A*'1'  iteration, 
d*  Search  direction  vector  at  the  A4*1'  iteration. 
atk  Search  distance  coefficient  at  the  fcth  iteration. 
fik  Deflection  coefficient  at  the  A:th  iteration. 

3.  Sum  ..ary  of  the  Conjugate  Gradient  Algorithm 

A  summary  of  the  conjugate  gradient  method  for  minimizing  a  differen¬ 
tiable  objective  function  J( h)  is  listed  below  [Ref.  13:p.  306]: 

Step  1.  Choose  an  initial  set  of  coefficients  ho. 

Step  2.  Calculate  the  initial  gradient.  g0  using  the  definition 


Step  3.  Let  the  initial  direction  vector  be  do  =  —go. 

Step  4.  Let  k=0. 

Step  5.  Let  a *  be  the  optimal  solution  to  the  problem  to  minimize  J( h*  + 
ctfcd*)  subject  to  a*  >  0. 

Step  6.  Update  the  new  coefficients  h*+i  using  the  equation 

hjt+1  =  h  k  +  a*d*.  (3.2) 

Step  7.  Calculate  the  next  gradient  vector  value  g*,.+1  using  the  new  coeffi¬ 
cients  hjt+j. 

Step  8.  Calculate  the  deflection  coefficient  /?;.  using  the  equation 

_  (gfc~n  gfc)  gr-+i  ^ 

' *  ~  g*g k 

Step  9.  Update  the  direction  vector  d*+i  using  the  equation 

d/r+l  =  —  gjtr+l  +  Pkdk-  (3.4) 


Step  10.  Replace  k  by  k  +  1  and  go  to  step  5. 

4.  Selection  of  a  Line  Search  Method 

The  conjugate  gradient  method  outlined  above  requires  that  a  search  dis¬ 
tance  coefficient  a*  be  found  that  minimizes  the  objective  function  J(hk  +  Ofcd*) 
subject  to  dk  >  0.  This  dictates  that  a  line  search  be  performed  starting  at  the  point 
in  multidimensional  space  defined  by  the  current  coefficient  vector  h*.  and  proceeding 
along  the  line  defined  by  the  current  direction  vector  d*  until  the  minimum  value  of 
the  objective  function  is  found.  The  distance  the  line  search  algorithm  travels  from 
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the  point  h*  to  the  minimum  value  of  the  function  is  then  defined  to  be  the  scalar 
value  a*.  A  number  of  methods  have  been  proposed  to  perform  this  line  search. 
These  include  the  uniform  search,  dichotomous  search,  the  golden  section  method, 
and  the  Fibonacci  method  [Ref.  13:pp.  253-264].  There  is  also  a  class  of  line  search 
methods  which  use  derivatives  to  assist  in  finding  the  minimum  value  of  the  objective 
function  [Ref.  13:pp.  264-269].  This  second  group  of  methods  was  considered  for 
use  with  the  conjugate  gradient  method  but  were  subsequently  rejected  due  to  the 
complexity  of  calculating  and  evaluating  the  required  derivatives.  The  selection  of 
an  appropriate  line  search  method  for  use  in  conjunction  with  the  conjugate  gradient 
method  was  based  primarily  on  efficiency.  All  of  the  methods  except  for  the  Fibonacci 
search  require  two  evaluations  of  the  objective  function  during  each  iteration  of  the 
algorithm.  The  Fibonacci  method,  however,  requires  only  a  single  evaluation  because 
it  also  uses  the  results  from  the  previous  iteration.  Comparison  of  the  line  search 
methods  mentioned  above  revealed  that  the  Fibonacci  search  method  is  the  most 
efficient  [Ref.  13:p.  264].  As  a  result,  the  Fibonacci  search  method  was  chosen  to  be 
used  in  conjunction  with  the  conjugate  gradient  method. 

The  Fibonacci  method  performs  a  search  for  the  minimum  value  of  a  func¬ 
tion  of  a  single  variable  over  a  closed  bounded  interval  [a,  6].  The  function  in  this 
case  is  J(hk  +  Qfcjdjt)  where  is  the  single  variable.  The  interval  over  which  the 
algorithm  searches  is  called  the  interval  of  uncertainty  and  limits  the  range  of  values 
for  ctkj.  The  lower  limit  for  Q;v  is  given  by  the  conjugate  gradient  method  as  zero, 
but  the  upper  limit  must  be  specified  before  the  algorithm  can  begin.  The  interval  of 
uncertainty  is  steadily  reduced  as  the  algorithm  progresses.  The  number  of  iterations 
which  the  algorithm  will  perform  must  also  be  specified  before  the  start  of  the  algo¬ 
rithm.  The  Fibonacci  method  is  based  on  the  Fibonacci  sequence  Fv  which  is  defined 


as 


u+i  =  Fv  +  Fu_  i 

(3.5) 

$ 

II 

ll 

►— * 

(3.6) 

The  resulting  sequence  is  1,1,2,3,5,8,13,21,34,55,89,....  The  Fibonacci  search 
method  begins  by  evaluating  the  objective  function  at  each  of  two  points  within  the 
interval  of  uncertainty  as  shown  in  figure  3.1. 

These  two  points,  which  we  will  call  Xj  and  are  calculated  using 


and 


Xj  =  aj  +  -—-{bj  -  aj) 

-Tn-j+l 


(3.7) 


N  =  a,  + 


Fn-j 

Fn—j+l 


( bj-aj ) 


(3.8) 


where  k  is  the  iteration  index  of  the  conjugate  gradient  algorithm,  j  is  the  iteration 
index  of  the  Fibonacci  algorithm,  (a_,,  is  the  current  interval  of  uncertainty,  and  n 
is  the  total  number  of  iterations  planned.  A  new  interval  of  uncertainty  [«J+], 5J+i] 
is  then  selected  based  on  the  value  of  the  objective  function  at  the  two  points  X}  and 
If  J(h*  +  Xjdk)  >  J{hk  +  fijdk),  then  the  new  interval  of  uncertainty 
is  given  by  [Aj,6j],  Likewise,  if  the  opposite  is  true,  J( h*  +  Ajd/;)  <  J(hk  +  fijdk ), 
then  the  new  interval  of  uncertainty  is  [aj,fij].  Both  cases  are  shown  in  Figure  3.2. 
The  key  feature  that  makes  the  Fibonacci  method  so  attractive  is  that,  for  the  next 
iteration  j  +  1,  either  =  n}  or  fiJ+\  =  A;,  depending  on  which  new  interval  of 
uncertainty  was  selected.  Since  the  objective  function  has  already  been  evaluated  at 
the  previous  values  for  Xj  and  nv  then  only  one  additional  evaluation  must  be  made 
for  each  succeeding  iteration.  At  the  completion  of  the  specified  n  iterations  of  the 
algorithm,  the  size  of  the  final  interval  of  uncertainty  will  be 


(6„  -  a„)  = 


(bp  -  «o) 

F„ 


•>■> 


(3.9) 


Figure  3.1:  Initial  evaluation  points  A0  and  /<0  and  interval  of  uncertainty 


Figure  3.2:  Evaluation  points  AJ+i  and  Hj+X  and  revised  interval  01  uncer¬ 
tainty  when  J(Xj)  >  J(/tj) 


Figure  3.3:  Evaluation  points  AJ+i  and  fij+x  and  revised  interval  of  uncer¬ 
tainty  when  J{Xj)  < 
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If  we  select  the  midpoint  of  the  final  interval  of  uncertainty  as  the  value  «/.„  to  be  used 
by  the  conjugate  gradient  method,  then  we  can  calculate  the  number  of  iterations  n 
required  to  achieve  a  desired  accuracy  after  deciding  upon  an  upper  bound  60.  The 
upper  bound  and  number  of  iterations  used  for  the  neural  network  problem  will  be 
presented  in  the  next  chapter. 

5.  Calculation  of  the  Deflection  Coefficient  fa 

The  equation  used  to  calculate  the  deflection  constant  fa  (equation  3.3) 
is  the  Polak-Ribiere  version  of  the  conjugate  gradient  method  originally  proposed  by 
Fletcher  and  Reeves  [Ref.  14:p.  253].  The  original  method  used  the  equation 

fa  =  (3.10) 

git  g* 

to  calculate  the  deflection  constant  fa.  The  two  equations  are  equivalent  if  the  ob¬ 
jective  function  to  be  minimized  is  quadratic.  Experimental  results,  however,  tend 
to  indicate  that  the  Polak-Ribiere  method  is  more  effective  for  nonquadratic  objec¬ 
tive  functions  [Ref.  14:p.  254].  This  is  because  the  Polak-Ribiere  method  tends  to 
reset  the  the  direction  vector  dfc+1  to  the  value  of  the  gradient  vector  g*.+1  when 
two  successive  gradients  g*.  and  g*+i  are  equal.  This  has  the  effect  of  beginning  the 
conjugate  gradient  method  anew,  using  the  present  coefficients  vector  h*  as  the  new 
initial  coefficient  vector  ho- 

B.  APPLYING  THE  CONJUGATE  GRADIENT  METHOD  TO  A  NEU¬ 
RAL  NETWORK 

1.  The  Neural  Network  Model  and  Notation 

The  generic  neural  network  model  to  be  used  for  the  purposes  of  discussion 
is  pictured  in  Figure  3.4.  The  notation  used  when  referring  to  the  various  variables 
of  the  model  is  as  follows: 
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Xij  The  jth  input  to  the  ith  layer  of  the  network.  For  other  than  the  inputs 
£oi>z<)2> •••>£()/,  the  variable  is  also  the  output  of  the  jth  neuron  in  the 
(i  —  l)lh  layer  and  is  a  function  of  the  previous  layer’s  inputs  and  weights  and 
the  jlh  neuron’s  threshold  value. 

Wijk  The  weight  in  the  ith  layer  of  the  network  that  connects  the  jlh  input  to  the 
fcth  neuron  of  the  layer. 

6ik  The  threshold  value  associated  with  the  klh  neuron  of  the  ith  layer  of  neurons. 
y  The  desired  output  value  of  the  network  for  a  given  set  of  inputs  a'0i,  ®02>  •  •  •  >  *o /• 

/(•)  The  transfer  function  of  the  neuron. 

2.  The  Neural  Network  Objective  Function  J(h) 

As  was  mentioned  in  the  previous  chapter,  we  wish  to  minimize  the  total 
sum  of  the  squared  errors  over  an  entire  training  data  set.  As  a  result,  the  objective 
function  J{ h)  to  be  minimized  using  the  conjugate  gradient  method  is 

E  =  Y.\^)  (3-11) 

t  Z 

where  e(t)  is  the  error  between  the  actual  and  the  desired  outputs  of  the  neural 
network  for  the  tlh  data  set. 

3.  The  Adaptation  Coefficients  h 

There  are  two  quantities  that  we  wish  to  adapt  in  order  for  the  neural 
network  to  consistently  produce  the  desired  output  for  a  given  input.  These  two 
quantities  are  the  connection  weights  wtJk  of  the  network  and  the  threshold  values  0th 
associated  with  each  neuron  in  the  network.  Together,  these  two  sets  of  coefficients 
form  the  coefficient  vector  h.  The  conjugate  gradient  algorithm  uses  a  single  vector 
h  to  represent  the  coefficients  which  are  adapted  to  minimize  the  objective  function 
J(h).  The  notation  used  for  the  neural  network  model,  however,  reflects  the  use  of 
matrices  for  the  weights  and  vectors  [0,*]  for  the  thresholds.  This  was  done  to 
simplify  the  identification  of  the  various  weights  and  thresholds.  We  must  therefore 


combine  and  transform  the  weight  matrices  and  threshold  vectors  into  a  single  vector 
h  in  order  to  apply  the  conjugate  gradient  algorithm.  This  is  done  by  assigning  the 
individual  weights  and  thresholds  to  a  vector  as  shown  in  equation  3.12. 

<yr  (3.i2) 

We  can  perform  the  conjugate  gradient  algorithm  using  the  vector  notation  and  then 
perform  a  reverse  transformation  at  the  completion  of  the  algorithm  to  assign  the 
final  weights  and  threshold  values  to  the  neural  network. 

4.  The  Gradient  Vector  g 

The  gradient  vector  g  used  by  the  conjugate  gradient  method  is  defined  as 

g  =  ^J(h).  (3.13) 

The  gradient  vector  g  for  the  neural  network  problem  consists  of  the  gradients  asso¬ 
ciated  with  the  weights  and  thresholds  of  the  neural  network.  The  gradient  vector  g 
would  be  of  the  form 

8  —  [Sol  1 1  fjovi'  •  •  •  >  (Joint'.  (Jll  1 5  •  •  •  5  ^3i  [JO o]  i  i JOoi  •  •  ■  • )  t lOj  ]  •  (3.14). 


The  gradient  for  any  particular  weight  or  threshold  of  the  network  is  calculated  by 
taking  the  partial  derivative  of  the  error  function  E  with  respect  to  the  particular 
weight  {wijk)  or  threshold  (0,*).  For  the  gradient  associated  with  a  weight  this  would 
be  expressed  as 


(Ji]k  — 


dE  1  d 


0tv,jk  2  dwijk 
and  for  the  gradient  associated  with  a  threshold  as 


iy«) 

L  1 


(3.15) 


[Jijk  — 


dE 

00ik 


1  0 
2  dOlk 


X>2(0 


1 1 


(3.16) 


The  partial  derivative  in  equations  3.15  and  3.16  can  be  moved  inside  the  respective 
summation  terms  resulting  in  the  following  expressions 


_  &E  _  1  ^  d  2 
9,jk  dw^  2  ,  dwijk 6 


(3.17) 


_  dE  _  1  ^  d  2 

9ijle  ~  ddik  ~  2  V  d6ike  (  )‘ 


(3.18) 


The  gradient  for  each  weight  toy*  can  therefore  be  expressed  as  the  sum  of  the  partial 
gradients 

m  =  Earn  (0  (3-W) 

t 

where  the  partial  gradient  g[jk(t)  is  the  gradient  associated  with  the  weight  w,jk  when 
evaluated  for  a  single  set  of  training  data  rather  than  the  entire  training  data  set.  The 
gradients  associated  with  the  threshold  values  of  the  neural  network  can  be  expressed 


in  a  similar  manner,  given  by 


9e,k  =  X>U*)- 


(3.20) 


For  the  purposes  of  notational  brevity,  we  will  assume  that  the  training  data  set 
consists  of  only  one  set  inputs  and  the  associated  desired  output.  This  will  allow 
us  reduce  the  length  of  equations  for  the  gradient  by  removing  references  to  the 
particular  element  of  the  training  set  used.  The  reader  should  remember,  however, 
that  if  there  are  s  pairs  of  data  in  the  training  set,  then  the  gradient  is  the  sum  of 
the  s  partial  gradients  as  expressed  in  equation  3.19  and  equation  3.20. 
a.  Neuron  Transfer  Function  Derivative 

Before  delving  into  the  derivation  of  the  equations  for  the  gradients 
of  the  weights  and  thresholds  of  the  neural  network,  a  few  comments  should  be  made 
concerning  the  transfer  function  used  for  the  neural  network  model  and  its  derivative. 
The  transfer  function  to  be  used  is  the  sigmoid  function  defined  by  equation  2.9 


in  Chapter  2.  A  key  feature  of  the  sigmoidal  function  is  that  its  derivative  can  be 
expressed  in  terms  of  its  original  value  by 

^/(*)  =  -/(*)  (!-/(*))•  (3-21) 

The  derivative  of  a  neuron’s  output  can  thus  be  expressed  as  a  function  of  the  output 
of  the  neuron  and  the  partial  derivative  of  the  neuron’s  inputs.  The  partial  derivative 
of  the  neuron’s  output  with  respect  to  is  then  given  by 

~d^t =  (_x,'+u)(1  “  Xi+l^d^~,  {9ik  "SWp)  (3-22) 

and 

=  (-*I+Ii)(l  -*.-+1*)^  («»  -  Ev.  (3.23) 

for  the  derivative  with  respect  to  0,*.  Equations  3.22  and  3.23  will  be  used  frequently 
to  evaluate  the  partial  derivatives  of  each  neuron’s  output  when  deriving  the  equations 
for  the  gradients  of  the  neural  network. 

b.  Calculation  of  the  Third  Layer  Gradient 

The  calculation  of  the  gradients  for  each  weight  and  threshold  of  the 
neural  network  begins  at  the  output  of  the  neural  network  where  the  difference  be¬ 
tween  the  actual  network  output  and  the  desired  output  produces  an  error.  This 
error  is  propagated  back  through  the  network  in  the  form  of  gradients.  The  gradient 
associated  with  the  output  weight  w3  can  be  expressed  as 


03  = 


dE 

dw3 


d  1  2  d  1  2 

to  2e 


(3.24) 


where  1U3.T3  is  the  output  of  the  network  and  y  is  the  desired  output  value.  Taking 
the  partial  derivative  yields 


93  =  (y  ~  W3.T3)  - —  {y  -  UJ3X3)  =  {y-  w3x3)  (-.r3) . 
dw3 


(3.25) 
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After  rearranging  the  terms  of  equation  3.25,  the  final  form  for  the  output  weight’s 
gradient  g$  becomes 

93  =  (w3x3  -  y)  x3.  (3.26) 


c.  Calculation  of  the  Second  Layer  Gradients 

Derivation  of  the  input,  first  and  second  layer  gradients  is  somewhat 
more  involved  than  that  of  the  third  layer  gradients  because  of  the  multiple  neurons 
and  weights  between  the  error  at  the  output  and  the  gradient  for  which  we  are  deriving 
an  expression.  The  gradient  equation  for  a  weight  in  the  second  layer  can  be  expressed 
as 


^  pi  ^ 

92]  =  a —  =  {y-  W3X3)  ~ - ( y  -  W3.T3)  •  (3.27) 

OlV  2  j  UXO2  j 

Of  the  terms  evaluated  by  the  partial  derivative,  only  the  output  of  the  third  layer 


neuron  3-3  is  affected  by  a  variation  of  the  second  layer  weight  w2j.  The  desired  output 
y  can  be  eliminated  and  the  partial  derivative  shifted  to  the  right  of  t  lie  output  weight 
term  w3.  This  yields  the  expression 


</2j  =  ( y  -  w3x3)  (~iv3)  - —  (.r3) . 

OW2j 


(3.28) 


We  can  replace  the  partial  derivative  term  in  equation  3.28  with  an  equivalent  ex¬ 
pression  that  can  be  evaluated  with  respect  to  u^j  using  equation  3.22.  1'liis  results 
in  the  following  expression 


92]  =  iV  ~  W3X3){-X3)  {-W3)  (1  -  3-3)  (02  -  Y,  W2p3-2p^J  ■  (3.29) 

Comparing  the  first  part  of  equation  3.29  with  equation  3.26,  we  find  that  we  can 
replace  the  first  two  terms  of  equation  3.29  with  the  output  weight's  gradient  g3. 
After  taking  the  partial  derivative,  only  one  term,  x2j,  remains.  The  equation  for  the 
second  layer  weight  gradient  becomes 


92]  =  93  (-U)3)  (1  -  3-3)  (-.«•  2j)  =  </3'-'3  ( 1  -  -r3)  .1  v  (3.30) 
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We  can  see  from  equation  3.30  that  the  gradient  g2j  is  a  function  of  weight  W3  that 
connects  the  neuron’s  output  to  the  next  layer,  the  gradient  g3  that  is  associated  with 
the  output  weight,  the  neuron’s  output  value  £3,  and  the  input  x2j  that  is  applied  to 
the  weight  for  which  we  are  calculating  the  gradient  (g2j).  This  relationship  between 
the  inputs,  outputs,  weights  and  gradients  will  be  found  to  be  consistent  for  each  of 
the  gradients  of  the  neural  network. 

Rather  than  starting  from  scratch  to  derive  the  equation  for  the  gra¬ 
dient  associated  with  the  output  neuron’s  threshold  02 ,  we  begin  at  the  point  where 
evaluation  of  the  partial  derivative  with  respect  to  02  differs  from  that  for  the  weight 
gradient  g2j.  The  equation  for  the  gradient  of  the  output  neuron’s  threshold  becomes 


96,  =  93  ( -  W3)  ( 1  -  S3)  ^  ^2  -  U!2pX2^j  .  (3.31 ) 

Evaluation  of  the  partial  derivative  yields  a  constant  of  one  since  none  of  the  sum¬ 
mation  terms  is  a  function  of  the  threshold  value  02.  Shifting  the  sign  term,  the  final 
form  for  the  gradient  is 

96,  =  "9^  3(1 -s3)  (3.32) 


Note  that  the  equation  for  the  gradient  of  the  neuron’s  threshold  value  02  (equation 
3.32)  has  the  same  form  as  that  for  the  input  weights  w2j  connected  to  the  outpu* 
neuron  (equation  3.29)  except  for  the  input  term  x2}.  We  can  treat  the  threshold 
value  as  a  weight  if  we  assume  that  the  threshold  ‘weight’  has  a  constant  input  of  -1. 
d.  Calculation  of  the  First  Layer  Gradients 

The  derivation  of  the  equation  for  the  gradient  of  the  first  layer  weights 
follows  in  a  similar  fashion  to  that  of  the  second  layer.  We  begin  at  the  point  where 


layer  weight  gradient  becomes 

9ijk  =  =  93  {~w3)  (1  -  *3)  (^2  ~  £  w2px2p j  .  (3.33) 

Only  the  output  of  the  A;th  neuron  in  the  second  layer  (x2 k)  is  affected  by  the  value  of 
the  weight  w\jk  of  the  first  layer.  Therefore  all  terms  except  for  the  klh  term  of  the 
summation  in  equation  3.33  are  zero  when  the  partial  derivative  is  taken.  This  yields 


the  expression 


9ijk  =  -93^3  (1  ~  ar3)  (— w2k)  -x - *a*. 

dwijk 


(3.34) 


Using  equation  3.22  we  can  rewrite  equation  3.34  as 

9ijk  =  -93W3  ( 1  -  0:3 )  ( ~ x2k )  ( - w2k )  (1  -  x2k)  ---  -  ]T  w .  (3.35) 

32k  q 

The  first  part  of  equation  3.35  can  be  replaced  with  the  gradient  glk  using  equation 
3.30.  Only  the  jth  term  of  the  summation  under  evaluation  by  the  partial  derivative 
with  respect  to  W\jk  is  nonzero.  The  equation  for  the  first  layer  gradients  of  the 


weights  then  becomes 


9ljk  =  92k  (~W2k)  (1  -  x2k)  (— .r,j) 


(3.36) 


which  when  rearranged  yields 


<lijk  =  92kW2k(l  -  X2k)  Xij. 


(3.37) 


Again,  the  present  layers’s  gradient  is  a  function  of  the  next  layer’s  gradients  and 
weights,  the  present  layer's  neuron  output  values,  and  the  input  to  the  present  layer. 

The  derivation  of  the  equation  for  the  gradient  associated  with  the 
neuron  thresholds  of  the  first  layer  follows  in  the  same  manner  as  that  of  the  second 
layer.  The  equation  for  the  threshold  gradients  92k  of  the  first  layer  can  be  expressed 


96n  =  92k  (-«•;*)  (1  -  x2k)  (o]k  -  £  wlllkxu)j 


(3.38) 


Evaluating  the  partial  derivative  results  in  the  final  equation 

9elk  =  -92kU*2k  (1  ~  xik)  (3.39) 

which  has  the  same  form  as  equation  3.32. 

e.  Calculation  of  the  Input  Layer  Gradients 

Derivation  of  the  input  layer’s  gradient  equation  differs  only  slightly 
from  the  previous  development.  The  difference  is  due  to  the  fact  that  a  variation  in 
the  value  of  a  weight  in  the  input  layer  affects  the  output  of  more  than  a  single  neuron 
in  the  next  layer  of  the  network.  This  means  that  we  must  retain  a  summation  term 
throughout  the  calculation  of  the  first  layer’s  gradient  equation.  The  gradient  for  the 
first  layer  weight  can  be  expressed  as 

dE  d  (  \ 

=  a^-k  ^-E«***J  •  <«°) 

The  threshold  02  is  not  a  function  of  the  input  layer’s  weights  and  is  eliminated  when 
its  partial  derivative  is  taken  with  respect  to  the  input  weight  The  other  terms 
under  evaluation  by  the  partial  derivative  (i.e.,  x2p)  are  all,  however,  a  function  of 
the  input  weight  ui0j*.  The  partial  derivative  can  be  moved  inside  the  summation 
resulting  in 

d 

9ojk  =  93^3  (1-  ar3)  2  W2p- - x2p.  (3.4 1 ) 

p  Owojk 

Shifting  the  summation  to  the  far  left  and  evaluating  the  partial  derivative  using 
equation  3.22  yields 

0 

dw0jk 

The  9\p  term  in  equation  3.42  can  be  eliminated  since  it  is  not  a  function  of  t%;.. 
The  remaining  terms  can  then  be  rearranged  to  produce 


9ojk  =  Y,93w*  (1  ~  *3)  uj2 p  (-a'2p)  (1  -  x2p) 


0\p  'y  ^  IL'iqkX 
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(3.42) 


The  value  g2p  can  be  substituted  for  the  first  part  of  equation  3.43  using  equation 
3.30.  Also,  the  output  of  the  kth  neuron  of  the  input  layer,  Xik,  is  a  function  of  the 
input  weight  wojk.  As  a  result,  evaluating  the  partial  derivative  using  equation  3.22 
results  in  the  equation 

d 


9ojk  =  ^92pW2p  (1  -  x2p)  Wlkp  {-xlk)  (1  -  xlk) 


dwi 


'0  jk 


WorkXQr  •  (3.44) 


Evaluating  the  partial  derivative  in  equation  3.44  with  respect  to  w0jk  we  find  that 
only  the  jih  term  of  the  summation  is  nonzero.  Rearranging  the  terms  yields 

9ojk  =  Y^92pW2P(  1  -  x2 p)  xik  wikp  (1  -  xn)  x0].  (3.45) 


aikp 


(!-*!*)• 


(3.47) 


Finally,  we  can  replace  the  first  four  terms  of  equation  3.45  with  the  value  gxkp  using 
equation  3.37.  This  results  in  the  equation  for  the  gradients  of  the  weights  of  the 
input  layer  of  the  network 

9ojk  =  ^gikpWik^J  (1  -  xu)  xoj.  (3.46) 

Using  the  same  reasoning  used  to  derive  equations  3.32  and  3.39  we  can  express  the 
equation  for  the  gradient  of  the  input  layer  neuron  thresholds  as 

9<>ok  ~  H  ~9lkpW\kp 
\  v  ) 

Derivation  of  the  equations  for  the  gradients  associated  with  the  weights  and  thresh¬ 
olds  of  the  neural  network  is  now  complete.  What  we  have  found  is  that  the  gradients 
for  any  particular  layer  of  the  network  can  be  expressed  as  a  function  of  the  given 
layer’s  weights,  thresholds,  inputs,  outputs,  and  the  following  layer’s  gradients.  It  is 
not  necessary  to  begin  at  the  output  of  the  network  and  use  the  network  output  error 
e{t)  to  calculate  the  gradient  for  a  particular  weight  or  threshold  which. is  several  lay¬ 
ers  back  in  the  network.  The  above  expressions  for  the  gradient  do.  however,  dictate 
that  the  gradient  calculations  begin  at  the  output  of  the  network  and  the  gradients 
be  propagated  back  through  the  network. 


5.  Fibonacci  Line  Search  Parameters 

Several  parameters  associated  with  the  Fibonacci  line  search  methods  must 
be  specified  before  the  conjugate  gradient  algorithm  described  in  this  chapter  can  be  ' 
applied.  These  parameters  are: 

•  The  initial  size  of  the  interval  of  uncertainty 

•  The  number  of  iterations  that  the  line  search  should  perform. 

The  Fibonacci  line  search  attempts  to  find  the  best  stepsize  (a*)  in  which  to  step 
along  the  error  function  surface  towards  the  global  minimum  in  a  direction  defined 
by  the  direction  vector  (d*).  The  initial  interval  of  uncertainty  is  the  interval  over 
which  the  algorithm  will  search  for  the  optimal  stepsize  (cu-).  The  initial  interval, 
therefore,  establishes  the  minimum  and  maximum  stepsize  values.  Our  goal  is  to  find 
the  optimal  set  of  weights  and  thresholds  by  moving  steadily  down  the  error  function 
surface  towards  the  global  minimum.  The  lower  bound  of  the  interval,  or  minimum 
stepsize  value,  is  therefore  zero  since  a  negative  value  would  move  the  algorithm  up  the 
error  function  surface  in  a  direction  opposite  the  direction  vector  (d*.).  Selection  of  an 
upper  bound  for  the  interval  entails  a  number  of  tradeoffs,  A  larger  maximum  value 
would  allow  the  algorithm  to  search  over  a  greater  interval  for  the  optimal  stepsize 
(at).  This  could  allow  the  conjugate  gradient  algorithm  to  converge  to  the  global 
minimum  more  quickly  by  enabling  it  to  step  farther  down  the  error  function  surface 
at  each  iteration  of  the  algorithm.  It  could  also  possibly  provide  more  protection 
against  being  trapped  in  a  local  minimum  by  allowing  the  line  search  algorithm  to 
search  beyond  the  confines  of  a  local  minimum.  A  larger  interval,  however,  requires 
that  a  greater  number  of  iterations  be  performed  to  reduce  the  interval  of  uncertainty 
to  the  required  degree.  This  final  interval  o.  mcertainty  must  be  small  so  that 
midpoint  of  the  interval  is  reasonably  close  to  the  optimal  stepsize  value.  It  is  this 
midpoint  that  is  the  stepsize  value  a*,  that  will  be  used  b\  the  conjugate  gradient 
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algorithm  to  update  the  weights  and  thresholds  of  the  neural  network.  A  larger  final 
interval  of  uncertainty  increases  the  chances  of  a  less  than  optimal  choice  for  the  final 
stepsize.  A  balance  must  therefore  be  struck  between  the  size  of  the  initial  interval  of 
uncertainty,  the  size  of  the  final  interval  of  uncertainty,  and  the  number  of  iterations 
to  be  performed. 

Initial  investigations  were  performed  to  determine  the  range  of  stepsize 
values  that  were  typical  for  various  neural  network  applications.  It  was  found  that 
the  stepsize  (a*)  generally  did  not  exceed  a  value  of  10.0  and  was  typically  less  than 
1.0.  An  initial  interval  of  uncertainty  of  10.0  was  therefore  used  throughout  remainder 
of  the  thesis  research. 

In  the  course  of  determining  the  initial  interval  of  uncertainty  it  was  found 
that  the  line  search  method  would  occasionally  yield  a  final  step  size  value  (a*)  which 
produced  an  error  function  value  much  greater  than  the  previous  iteration’s  value.  It 
was  determined  that  this  problem  was  a  result  of  the  error  function  surface  not  being 
unimodal  in  the  direction  (d*)  along  which  the  algorithm  searched  for  the  minimum. 
If  this  second  minimum  was  closer  to  one  of  the  two  evaluation  points  (A/.  and  /t*) 
than  the  true  minimum,  as  shown  in  figure  3.5,  then  the  algorithm  would  converge  to 
this  second  minimum.  This  would  result  in  an  error  function  value  larger  than  when 
the  line  search  algorithm  started.  To  remedy  this  problem,  the  initial  interval  of 
uncertainty  was  shifted  to  the  left  so  that  the  first  point  evaluated  was  for  A0  =  0.  If 
the  error  function  for  the  final  stepsize  value  (q^.)  was  greater  than  the  error  function 
value  with  a  stepsize  of  zero,  then  a  stepsize  of  zero  was  returned  as  the  final  stepsize 
value  (a*).  This  had  the  effect  of  resetting  the  conjugate  gradient  algorithm.  A 
stepsize  of  zero  caused  the  algorithm  to  retain  the  same  weights  and  thresholds  for 
the  next  iteration  of  the  algorithm.  As  a  result,  the  gradient  (g*+i)  at  the  next 
iteration  was  identical  to  the  previous  gradient  (g*.)  and  the  two  successive  identical 


gradients  would  produce  a  deflection  coefficient  (/?*)  equal  to  zero.  This  would  reset 
the  direction  vector  (d*)  to  the  value  of  the  present  gradient  (g*)  rather  than  the 
weighted  sum  of  previous  gradients.  This  had  the  effect  of  reinitializing  the  conjugate 
gradient  method,  but  at  a  new  starting  point  (h*)  on  the  error  function  surface. 


Figure  3.5:  Line  profile  of  the  error  function  surface 

Having  fixed  the  initial  interval  of  uncertainty,  the  number  of  iterations  of 
the  line  search  algorithm  performed  during  each  iteration  of  the  conjugate  gradient 
method  was  varied  to  determine  an  optimal  number.  Using  sixteen  iterations,  the 
conjugate  gradient  algorithm  was  able  to  consistently  reduce  the  value  of  the  error 
function.  The  value  of  the  error  function  did  not  consistently  drop  when  fewer  than 
sixteen  iterations  were  used.  Using  equation  3.9  this  resulted  in  a  final  interval  of 
uncertainty  of  0.00626. 
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C.  COMPUTER  PROGRAM  IMPLEMENTATION 


1.  Conjugate  Gradient  Algorithm 

The  conjugate  gradient  algorithm  was  implemented  for  a  multiple  input, 
single  output  neural  network  using  the  C  programming  language.  A  flow  chart  show¬ 
ing  the  basic  functions  that  are  performed  by  the  program  is  shown  in  figure  3.6.  The 
user  is  prompted  at  the  start  of  the  program  for  the  number  of  neurons  in  each  stage 
of  the  neural  network,  the  number  of  iterations  of  the  conjugate  gradient  algorithm 
that  should  be  performed,  and  the  name  of  the  input  file  that  contains  the  training 
data  that  the  algorithm  will  use  to  adapt  the  weights  and  thresholds  of  the  network. 
The  number  of  neurons  allowed  in  the  network  is  limited  to  a  total  of  50  and  the 
number  of  weights  connecting  the  neurons  is  limited  to  500.  This  maximum  number 
of  neurons  and  weights  was  more  than  large  enough  for  the  various  problems  to  which 
the  program  was  applied.  The  training  data  file  consists  of  columns  of  data  in  which 
each  column  is  associated  with  an  input  to  the  neural  network  except  for  the  the  last 
column.  The  last  column  is  the  desired  output  of  the  neural  network.  Each  row  is  a 
separate  training  data  set.  Upon  completion  of  the  program  three  files  are  produced. 
The  first  is  a  file  that  contains  the  final  results.  The  first  column  of  the  file  is  the 
desired  value  and  the  second  column  is  the  value  that  the  neural  network  produced 
using  the  final  weights  and  thresholds  of  the  network.  If  the  algorithm  has  performed 
as  expected  and  reduced  the  error  function  to  a  small  value,  then  the  two  columns  of 
data  should  be  nearly  identical.  The  second  output  file  produced  contains  the  final 
weights  and  thresholds  of  the  network.  This  file  can  then  be  used  by  any  other  pro¬ 
gram  which  simulates  the  operation  of  a  neural  network  with  the  same  configuration 
of  neurons.  The  final  file  is  produced  only  if  the  neural  network  has  two  inputs.  The 
file  consists  of  a  21  x  21  matrix  of  neural  network  output  values  that  were  produced 
by  applying  a  sequence  of  twenty-one  evenly  spaced  values  between  0.0  and  1.0  to 
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each  of  the  two  inputs.  The  resulting  file  can  be  rsed  to  produce  a  three  dimensional 
mesh  of  the  output  surface  of  the  neural  network.  Examples  of  the  input  screen, 
output  screen,  and  both  the  input  and  output  files  are  contained  in  Appendix  A.  A 
copy  of  the  C  program  source  code  is  contained  in  Appendix  B. 

2.  Backpropagation  Algorithm 

In  order  to  evaluate  the  conjugate  gradient  algorithm’s  performance,  the 
backpropagation  method  was  also  implemented.  The  basic  flow  chart  for  the  back- 
propagation  method  is  shown  in  figure  3.7.  Because  of  the  similarity  between  the  con¬ 
jugate  gradient  method  and  the  backpropagation  methods,  this  required  only  a  few 
changes  to  the  program  that  implemented  the  conjugate  gradient  algorithm.  These 
changes  consisted  of 

•  Replacing  the  stepsize  value  (a/..)  calculated  by  the  Fibonacci  line  search  with  a 
user  specified  constant  referred  to  as  the  learning  rate  by  the  backpropagation 
method. 

•  Replacing  the  deflection  coefficient  (/?*)  which  is  calculated  foi  every  iteration 
of  the  algorithm  with  a  user-specified  constant  referred  to  as  the  momentum 
factor  by  the  backpropagation  method. 

•  Updating  the  weights  and  thresholds  of  the  neural  network  after  the  application 
of  each  training  data  set  rat  her  than  upon  completion  of  a  complete  pass  through 
the  entire  training  data  file. 

The  input  and  output  files  remain  the  same  as  those  for  the  conjugate  gradient  version 
of  the  program. 

The  following  chapter  compares  the  performance  of  the  conjugate  gradient 
and  backpropagation  algorithms  and  also  presents  the  results  of  several  neural  network 
applications. 
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Figure  3.7:  Backpropagation  algorithm  flowchart 
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IV.  RESULTS 


In  this  chapter,  the  results  of  the  research  conducted  on  neural  networks  using 
the  conjugate  gradient  method  are  presented.  The  chapter  is  divided  into  two  parts. 
The  first  concerns  the  performance  of  the  conjugate  gradient  algorithm  compared 
to  that  of  the  backpropagation  method.  The  second  provides  several  examples  of 
neural  network  applications.  Where  possible,  the  performance  of  the  neural  network 
is  compared  to  its  linear  counterpart. 

A.  CONJUGATE  GRADIENT  ALGORITHM  PERFORMANCE 
1.  Performance  Measures 

The  rationale  for  implementing  the  conjugate  gradient  algorithm  was  to 
develop  an  alternative  to  the  backpropagation  method  that  would  converge  more 
quickly  to  the  optimal  set  of  weights  and  thresholds  for  a  given  problem.  The  error 
function  ( E )  is  a  measure  of  whether  the  weights  and  thresholds  of  a  neural  network 
are  optimum  when  applied  to  a  particular  problem.  The  smaller  the  error  function 
value,  the  more  nearly  optimum  the  weights  and  thresholds  are.  Both  algorithms 
reduce  the  value  of  the  error  function  by  iteratively  adapting  the  weights  and  thresh¬ 
olds  of  the  neural  network.  The  rate  at  which  the  backpropagation  and  conjugate 
gradient  algorithms  converge  to  the  optimal  set  of  weights  and  thresholds  can  be 
measured  using  several  methods.  The  simplest  approach  would  be  to  determine  the 
number  of  iterations  each  algorithm  requires  to  reduce  the  value  of  the  error  function 
to  a  prescribed  level.  The  number  of  iterations  for  each  algorithm  would  then  be 
compared  and  the  algorithm  requiring  fewer  iterations  would  be  considered  to  con¬ 
verge  more  quickly.  This  approach  does  not,  however,  take  into  account  the  greater 
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computational  complexity  of  the  conjugate  gradient  method.  A  more  accurate  mea¬ 
sure  of  performance  for  the  purposes  of  comparison  is  the  number  of  multiplications 
performed  by  each  algorithm.  This  measure  better  reflects  the  relative  computational 
requirements  of  the  two  algorithms.  The  number  of  multiplications  performed  by  each 
of  the  methods  over  one  iteration  is  fixed.  We  can  therefore  calculate  a  multiplication 
ratio  of  the  two  methods  and  then  use  this  ratio  in  conjunction  with  the  number  of 
iterations  to  compare  their  relative  performance. 

2.  Calculation  of  the  Multiplication  Ratio 

The  number  of  multiplications  performed  by  both  the  backpropagation 
method  and  the  conjugate  gradient  method  over  one  iteration  is  a  function  of  several 
variables.  These  include  the  number  of  neurons  and  weights  in  the  network,  the  size 
of  the  training  data  file  used  to  train  the  network,  and  the  number  of  iterations  per¬ 
formed  by  the  Fibonacci  line  search  method.  Tables  4.1  and  4.2  show  the  number  of 
multiplications  required  by  various  functions  of  the  conjugate  gradient  and  backprop¬ 
agation  method,  respectively.  The  tables  also  show  the  total  number  of  times  each 
function  is  performed  during  a  single  iteration  of  the  algorithm.  The  variable  T  is  the 
number  of  training  data  sets  used  to  train  the  network,  the  variable  P  is  the  number 
of  weights  and  thresholds  in  the  network,  and  R  is  the  number  of  neurons  in  the 
network.  Table  4.1  figures  reflect  that  the  step  size  (a*)  is  calculated  using  sixteen 
iterations  of  the  Fibonacci  line  search  algorithm.  The  total  number  of  multiplications 
(M)  performed  by  each  of  the  algorithms  is  therefore 

Mcg  =  T{'2QP  +  37  R  +  17)  +  21  {P  +  R)  +  35  (-1.1 ) 

for  the  conjugate  gradient  method  and 

MBp  =  T(bP  +  57?)  (4.2) 
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TABLE  4.1:  MULTIPLICATIONS  -  CONJUGATE  GRADIENT 
METHOD 


Function 

Times 

Performed 

Number  of 
multiplies 

Calculate  network  output 

1ST 

P  +  2R 

Calculate  gradient  vector 

T 

2  P  +  R 

Calculate  deflection  coefficient 

1 

•IP  +  2  R 

Update  direction  vector 

1 

P  +  R 

Calculate  step  distance 

1 

35 

Update  weight  vector 

IS 

P  +  R 

Calculate  error  sum 

17 

M 

TABLE  4.2:  MULTIPLICATIONS  -  BACKPROPAGATION  METHOD 


Function 

Times 

Performed 

Number  of 
multiplies 

Calculate  network  output 

T 

P  +  2R 

Calculate  gradient  vector 

T 

2  P  +  R 

Update  direction  vector 

T 

P  +  R 

Update  weight  vector 

T 

P  +  R 
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for  the  backpropagation  method.  We  can  then  derive  the  multiplication  ratio  by 
dividing  MCg  by  Mbp  to  obtain 


RATIO  = 


Mcg 

Mbp 


T{20P  +  Z1R  +  17)  +  21  (P  +  R)  +  35 
T{5P  +  5  R) 


Equation  4.3  can  then  be  factored  into  four  terms  as  shown  below 


(4.3) 


RATIO  ----  4  + 


17{R  +  V 
5  (P  +  R) 


21  35 

+  5T  +  5  T{P  +  R)‘ 


(4.4) 


For  the  purposes  of  approximation,  the  last  two  terms  of  equation  4.4  can  be  elim¬ 
inated  since  the  number  of  training  data  sets  used  to  train  the  neural  network  is 
typically  large.  As  the  number  of  neurons  in  a  network  is  increased,  the  number  of 
connections  or  weights  in  the  network  increases  at  a  much  greater  rate.  This  happens 
because  each  neuron  in  a  given  layer  is  connected  to  every  neuron  in  the  next  layer 
of  the  network.  As  a  result,  the  second  term  of  equation  4.3  steadily  decreases  as 
the  number  of  neurons  is  increased.  The  lower  bound  on  the  multiplication  ratio  is 
therefore  approximately  four  and  the  upper  bound  can  be  set  at  approximately  five 
for  networks  having  more  than  just  a  few  neurons. 

3.  Performance  Results 

The  performance  of  the  conjugate  gradient  method  was  compared  to  the 
performance  of  the  backpropagation  method  using  two  different  training  problems. 
The  first  consisted  of  training  the  neural  network  to  produce  a  binary  output  of  either 
one  or  zero  depending  on  the  inputs  to  the  network.  The  second  problem  involved 
training  the  neural  netwoik  to  produce  a  specific  value  within  the  range  of  zero  to 
one  for  a  given  set  of  inputs  to  the  network. 

A  plot  of  the  normalized  value  error  function  versus  the  number  of  itera¬ 
tions  performed  for  the  binary  problem  is  pictured  in  Figure  4.1  for  the  backpropa¬ 
gation  algorithm  and  in  Figure  4.2  for  the  conjugate  gradient  algorithm.  Note  the 
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difference  in  the  horizontal  scale  of  the  two  figures.  The  error  function  steadily  de¬ 
creased  for  the  conjugate  gradient  method  while  the  error  function  actually  increased 
for  approximately  the  first  100  iterations  of  the  backpropagation  algorithm.  Also 
note  that  the  error  function’s  rate  of  change  was  much  more  even  for  the  conjugate 
gradient  algorithm  than  for  the  backpropagation  method. 

In  order  to  compare  the  relative  performance  of  the  two  algorithms,  the 
multiplication  ratio’s  upper  bound  of  five  was  used.  Pictured  in  Figure  4.3  is  a 
comparison  of  the  two  algorithms’  convergence  rates  with  respect  to  the  approximate 
number  of  multiplications  performed  by  each  algorithm.  As  can  be  seen,  for  the  binary 
case,  the  conjugate  gradient  method  consistently  outperformed  the  backpropagation 
method  for  any  given  number  of  multiplications  performed. 

The  results  were  even  more  apparent  for  the  continuous  output  problem. 
The  backpropagation  method  was  unable  to  significantly  reduce  the  error  function’s 
value  for  the  first  500  iterations  of  the  algorithm  as  is  shown  in  Figure  4.4.  The  conju¬ 
gate  gradient  method,  however,  steadily  reduced  the  value  of  the  error  function  value 
after  each  iteration  of  the  algorithm  (Figure  4.5).  Comparison  of  the  convergence 
rates  of  the  two  methods  with  respect  to  the  number  of  multiplications  required  in 
each  case  is  shown  in  Figure  4.6.  For  any  given  number  of  multiplications  the  conju¬ 
gate  gradient  method  greatly  outperformed  the  backpropagation  method. 

The  conclusion  from  the  two  examples  above  is  that  the  conjugate  gradient 
method  performs  as  well  or  better  than  the  backpropagation  method  with  respect  to 
both  the  number  of  iterations  and  the  number  of  multiplications  required  to  reduce 
the  error  function  to  a  desired  level.  The  conjugate  gradient  method  therefore  satisfies 
one  goal  of  this  thesis  which  was  to  develop  an  alternative  to  the  backpropagation 
method  that  would  converge  more  quickly  to  the  optimal  set  of  weights  and  thresholds 
for  any  given  problem. 
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Figure  4.2:  Binary  problem  -  conjugate  gradient 
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Figure  4.3:  Binary  problem  -  comparison 
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Figure  4.5:  Continuous  problem  -  conjugate  gradient. 
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Figure  4.6:  Continuous  problem  -  comparison 
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B.  NEURAL  NETWORK  APPLICATION  RESULTS 


Several  simple  applications  were  chosen  to  evaluate  the  performance  of  the  con-  < 
jugate  gradient  method  vis-a-vis  the  backpropagation  method.  These  applications 
were  also  used  to  develop  a  better  understanding  of  the  potential  signal  processing 
applications  for  the  neural  network.  When  possible,  the  neural  network’s  performance 
was  compared  to  its  linear  counterpart. 

1.  A  Classification  Problem 

The  goal  for  this  problem  was  to  train  a  neural  network  to  differentiate 
between  two  classes  of  inputs.  The  two  classes  of  inputs  consisted  of  points  which 
fell  either  inside  or  outside  of  a  circle  with  a  diameter  of  0.5  centered  within  in  a 
unit  square  as  shown  in  Figure  4.7.  This  classification  problem,  although  relatively 
simple,  is  representative  of  one  of  the  primary  tasks  to  which  neural  networks  have 
been  applied — pattern  recognition  and  classification  [Ref,  l:pp.  66-67]. 

The  points  used  to  construct  the  training  data  file  were  evenly  spaced  0.1 
apart  from  zero  to  one  for  both  the  A'o  and  A'i  coordinates  as  shown  in  Figure  4.7. 
This  produced  a  total  of  121  points  over  the  unit  square.  The  training  data  file 
was  composed  of  121  data  sets,  each  set  consisting  of  the  coordinates  for  one  of  the 
training  points  and  a  value  representing  the  desired  class  to  which  the  point  belonged. 
The  desired  value  for  a  point  falling  inside  the  circle  was  a  one.  The  desired  value 
for  a  point  falling  outside  the  circle  was  a  zero.  The  conjugate  gradient  algorithm 
was  used  to  train  a  neural  network  which  had  two  inputs,  eight  first  layer  neurons, 
four  second  layer  neurons,  and  one  output  neuron  (a  2-8-4-1  configuration).  After 
100  iterations  of  the  algorithm,  the  total  squared  error  summed  over  the  entire  121 
training  data  sets  was  6.26  x  10-2.  The  resulting  output  of  the  neural  network  as  a 
function  of  its  inputs  is  pictured  in  Figure  4.8. 
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The  neural  network  produced  output  values  ranging  from  1.56  x  10“9  to 
1.0  and  was  able  to  properly  identify  the  class  to  which  each  of  the  training  data 
points  belonged.  The  contour  plot  of  the  neural  network  output  for  a  single  contour 
value  of  0.5  is  shown  in  Figure  4.9.  The  plot  clearly  shows  that  the  conjugate  gradient 
algorithm  was  able  to  calculate  a  set  of  weights  and  thresholds  for  the  neural  network 
that  very  closely  approximates  the  desired  result.  A  circular  decision  region  was 
formed  that  allowed  the  neural  network  to  differentiate  between  points  falling  inside 
the  circle  and  points  falling  outside  the  circle.  This  is  because  a  neural  network,  due 
to  its  nonlinearity,  has  the  ability  to  form  arbitrarily  complex  decision  regions. 

This  simple  example  clearly  demonstrates  the  ability  of  a  neural  network 
to  produce  a  nonlinear  mapping  of  a  set  of  analog  inputs  to  a  single  binary  output 
value.  In  this  case,  this  nonlinear  mapping  was  used  to  produce  the  two  decision 
regions  pictured  in  Figure  4.9.  For  other  applications,  the  formation  of  decision 
regions  may  not  be  called  for.  Rather,  the  output  of  the  network  may  have  to  be 
continuously  variable. 

2.  Nonlinear  Time  Series  Prediction 

The  previous  problem  required  the  neural  network  to  produce  only  a  binary 
output  of  one  or  zero.  The  second  application  was  selected  so  that  the  conjugate 
gradient  algorithm’s  performance  could  be  evaluated  for  the  case  of  a  continuously 
variable  range  of  desired  output  values.  This  type  of  application  falls  into  a  second 
class  of  tasks  for  which  the  neural  network  can  be  applied — nonlinear  mapping  of  a 
set  of  analog  inputs  to  an  analog  output  value  [Ref.  l:p.  67],  It  was  decided  to  apply 
the  neural  network  to  the  problem  of  one-step  prediction  of  a  nonlinear  time  series. 
One-step  prediction  is  a  fairly  common  application  in  digital  signal  processing.  A 
nonlinear  time  series  was  used  since  one-step  prediction  for  a  linear  time  series  could 
easily  be  satisfied  using  a  linear  filter  rather  than  a  neural  network.  The  method 
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used  to  perform  the  prediction  is  similar  to  that  used  by  a  linear  predictor.  The 
next  value  in  the  series  is  predicted  using  the  previous  values  of  the  series.  The  basic 
configuration  is  pictured  in  Figure  4.10. 


Figure  4.10:  Time  series  predictor 

For  a  linear  predictor  the  output  of  the  predictor  is  merely  a  weighted  sum 
of  a  given  number  of  previous  values  of  the  series.  The  neural  network,  however,  can 
produce  an  output  which  is  a  nonlinear  function  of  a  given  number  of  previous  values. 
The  nonlinear  time  series  used  to  train  and  evaluate  the  conjugate  gradient  algorithm 
was  produced  using 

a'n+i  "  4Bxn(l  —  xn).  (4.5) 

This  equation  is  referred  to  as  the  classic  logistic  or  Feigenbaum  map  and  has  been 
studied  quite  extensively  because  its  simplicity  and  its  application  to  chaos  theory. 
This  iterated  equation  (equation  4.5)  produces  an  ergodic,  chaotic  time  series  that  is 
bounded  and  quasi-periodic  [Ref.  12:p.  10).  A  training  sequence  of  100  samples  was 
generated  using  equation  4.5  with  the  variable  B  equal  to  1.0.  This  sequence  was 
then  used  to  adaptively  calculate  the  optimal  coefficients  for  a  linear  second  order 
prediction  filter  using  a  recursive  least  squares  method.  The  linear  predictor’s  results 
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are  pictured  in  Figure  4.11.  Only  the  first  fifty  samples  of  the  sequence  were  plotted 
so  that  the  two  curves  on  the  graph  could  be  better  differentiated.  It  is  obvious 
from  Figure  4.11  that  the  linear  predictor  was  unable  to  accurately  predict  the  next 
value  in  the  nonlinear  series  using  the  two  previous  values  of  the  series.  When  the 
difference  between  the  the  actual  and  predicted  signals  is  plotted  one  can  see  that  the 
magnitude  of  the  error  is  almost  as  great  as  the  magnitude  of  the  original  signal  (see 
Figure  4.12).  As  was  expected,  the  linear  predictor  performs  poorly  for  a  nonlinear 
problem. 

The  same  training  sequence  was  then  used  by  the  conjugate  gradient  al¬ 
gorithm  to  train  a  neural  network  with  a  2-4-2-1  configuration.  The  network  was 
trained  to  predict  the  next  value  of  the  series  based  on  the  two  previous  values.  Af¬ 
ter  100  iterations,  the  sum  of  the  squared  errors  over  the  100  training  data  sets  was 
7.25  x  10"3.  This  would  equate  to  an  average  standard  deviation  from  the  actual  sig¬ 
nal  of  approximately  8.51  x  10-3.  The  neural  network’s  results  are  pictured  in  Figure 
4.13.  It  is  apparent  that  the  neural  network  performed  much  better  than  the  linear 
predictor.  The  prediction  error  for  the  neural  network  is  pictured  in  Figure  4.14.  The 
magnitude  of  the  neural  network’s  prediction  error  is  much  smaller  than  that  for  the 
linear  predictor.  This  error  could  also  be  reduced  even  further  if  additional  iterations 
of  the  conjugate  gradient  were  performed. 

This  example  demonstrates  that  a  neural  network  is  quite  capable  of  per¬ 
forming  nonlinear  mapping  of  a  set  of  analog  inputs  to  an  analog  output.  The  neural 
network  can  also  produce  more  accurate  results  than  the  linear  approach  when  the 
problem  to  be  solved  is  nonlinear.  It  must  be  recognized,  however,  that  although 
the  neural  network  produces  more  accurate  results,  it  is  much  more  computationally 
complex  than  the  linear  approach  to  the  problem. 
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Figure  4.14:  Neural  network  prediction  error 
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3.  Channel  Equalization 

One  final  example  will  serve  to  demonstrate  the  potential  applications  for 
the  neural  network.  The  idea  of  using  a  neural  network  to  perform  channel  equaliza¬ 
tion  for  a  nonminimum  phase  transmission  channel  was  borrowed  from  Gibson,  Siu, 
and  Cowan  [Ref.  15].  The  experimental  results  indicate  that  a  neural  network  could 
potentially  provide  superior  performance  to  its  linear  counterpart  when  the  channel 
over  which  the  digital  data  is  transmitted  is  nonminimum  phase. 

a.  Transmission  Channel  Model  and  Equalizer  model 

When  digital  data  is  transmitted,  it  frequently  becomes  distorted  by 
the  channel  over  which  it  travels.  This  distortion  can  frequently  be  modeled  using 
a  linear  time  invariant  (LTI)  system  [Ref.  8:p.  426],  The  channel  model,  shown  in 
Figure  4.15,  consists  of  the  transfer  function  H(z)  and  a  channel  noise  term  The 


channel 


Figure  4.15:  Channel  model  and  equalizer 

transfer  function  of  the  channel  is  defined  by  a  finite  impulse  response  (FIR)  equation 

H(z)  —  ao  +  a^z  1  +  ' •  +  afiZ  1 .  (4.6) 

The  channel  noise  term  n,  is  typically  assumed  to  be  zero  mean,  additive  white 
gaussian  noise.  The  purpose  of  a  channel  equalizer  also  shown  in  Figure  4.15  is  to 
reverse  the  distorting  effects  of  the  channel  and  to  recover  the  original  signal  (a  ,)  using 
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m  samples  of  the  received  signal,  t/,-,yt_i, . . .  ,y,_m+1.  If  we  assume,  for  a  moment, 
that  the  noise  term  (n*)  is  zero,  then  the  received  signal  y,-  is  merely  a  weighted  sum 
of  the  present  and  past  values  of  the  original  signal  x This  can  be  expressed  as 

k 

(4.7) 

j=o 

where  ay  are  the  fc-f 1  coefficients  associated  with  the  channel  transfer  function  H(z). 
For  a  binary  signal(±l),  therefore,  the  received  signal  y,  can  assume  only  one  of  2k 
possible  values.  If  we  then  try  to  estimate  the  original  signal  xt-  using  an  m  sample 
vector  [y,-,  y,_i, . . .  ,y,_m+ 1],  we  can  only  form  a  fixed  number  of  permutations  of  the 
received  signal  vector.  Each  received  signal  vector  [y»,y,--i,... ,y;_m+i]  belongs  to 
either  the  set  of  vectors  corresponding  to  a  transmitted  binary  one  (+1 )  or  the  set  of 
vectors  corresponding  to  a  transmitted  binary  zero  (  —  1).  The  channel  equalizer  pro¬ 
duces  an  estimate  of  the  transmitted  signal  x;  by  determining  which  set  the  received 
signal  vector  belongs  to.  It  has  been  shown  that  a  linear  transversal  equalizer  can 
perform  such  an  operation  if  the  channel  transfer  function  H(z)  is  minimum  phase 
[Ref.  15:p.  1184].  If  the  channel  transfer  function  is  not  minimum  phase,  then  the  two 
received  vector  sets  are  not  linearly  separable  and  a  linear  equalizer  cannot  accurately 
estimate  £,■  based  on  the  received  data  vector  set  [y;,y,-i, . . . ,  </,_„l+i].  If  a  delay,  d. 
is  introduced  in  the  calculation  of  x,-,  such  that  the  at  the  il1'  iteration  the  equalizer 
estimates  the  original  signal  x,_d,  then  accurate  estimation  of  the  original  signal  can 
be  achieved  [Ref.  15:p.  1184].  This  value  for  d  however,  may  not  be  known,  or  may 
vary  with  time.  The  result  is  that  a  linear  transversal  equalizer,  even  with  a  delay, 
may  not  be  able  to  satisfactorily  equalize  a  nonminimum  phase  channel, 
b.  A  Nonminimum  Phase  Channel  Equalizer 

The  ability  of  a  neural  network  to  form  arbitrary  decision  regions, 
demonstrated  in  Chapter  II,  could  possibly  remedy  this  problem.  To  investigate  this 
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concept,  the  first  order  nonminimum  phase  transfer  function  {H(z)  =  0.5  +  2-1)  was 
used  to  evaluate  the  performance  of  both  a  neural  network  and  a  linear  transversal 
equalizer.  The  possible  values  for  yt-  using  this  transfer  function  are:  +1.5,  +0.5,  —0.5, 
and  —1.5.  A  two  input  neural  network  and  two  input  linear  transversal  equalizer  were 
used  since  the  channel’s  transfer  function  was  only  first  order,  and  this  allowed  a 
graphical  analysis  of  the  problem.  The  eight  possible  combinations  of  y,  and  y,_ i  are 
shown  in  Figure  4.16.  The  symbol  x  indicates  that  the  original  signal  a-,-  had  a  value 
of  —1  and  the  symbol  o  indicates  that  a,-  was  equal  to  +1.  Notice  that  the  symbols 
are  intermixed  such  that  no  single  line  can  be  drawn  that  will  completely  separate  the 
two  classes  of  symbols.  This  is  what  makes  the  nonminimum  phase  case  intractable 
for  the  linear  transversal  equalizer.  If  the  noise  term,  n,-,  is  now  incorporated  into 
the  problem,  the  result  is  as  shown  in  Figure  4.17  for  a  signal-to-noise  ratio  (SNR) 
of  10  dB.  The  number  of  possible  values  for  y;  becomes  infinite,  but  the  points  are 
distributed  about  the  original  eight  points  shown  in  Figure  4.16.  The  coefficients 
for  a  first  order  linear  transversal  equalizer  were  calculated  by  applying  a  recursive 
least  squares  (RLS)  algorithm  to  the  set  of  500  consecutive  values  of  y,  pictured  in 
Figure  4.17.  The  values  for  y,  were  generated  by  using  a  random  sequence  of  +1  and 
-1  for  a,,  applying  this  binary  sequence  to  the  transfer  function  given  above,  and 
adding  a  normally  distributed  noise  term  with  a  standard  deviation  equivalent  to  a 
signal-to-noise  ratio  of  10  dB.  The  linear  transversal  equalizer’s  two  decision  regions 
are  pictured  in  Figure  4.18.  The  region  that  is  shaded  with  dots  is  the  area  for  which 
the  linear  transversal  equalizer  produced  an  estimate  of  +1  for  x,  and  the  unshaded 
region  where  the  equalizer  produced  an  estimate  of  —1  for  Xi.  Note  that  the  best  that 
the  linear  equalizer  could  do  was  to  define  two  decision  regions  such  that  three  of  the 
four  possible  points  fell  within  the  proper  region.  The  same  500  value  data  set  was 
then  used  to  train  a  neural  network  having  a  2-6-4 -1  configuration.  The  decision 
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regions  formed  by  the  neural  network  after  100  iterations  of  the  conjugate  gradient 
algorithm  are  pictured  in  Figure  4.19.  The  neural  network,  because  of  its  ability 
to  account  for  the  nonlinearities,  was  able  to  form  two  separate  decision  regions  for 
each  of  the  two  possible  values  for  x,.  The  four  decision  regions  properly  encompass 
the  eight  possible  points  associated  with  y ;  and  yi-i.  As  a  result,  the  total  number 
of  errors  produced  over  the  500  value  training  set  dropped  from  151  for  the  linear 
equalizer  to  65  for  the  neural  network.  The  neural  network’s  ability  to  form  more 
complex  decision  regions  allowed  it  to  more  accurately  perform  equalization  when  the 
transfer  function  was  nonminimum  phase. 

c.  A  Nonminimuni  Phase  Channel  Equalizer  Using  a  Delay 
It  was  stated  earlier  that  introduction  of  a  delay  d  could  allow  the 
linear  equalizer  to  more  accurately  perform  its  equalization  function.  Pictured  in 
Figure  4.20  are  the  eight  possible  points  associated  with  y,  and  j/,_i  for  a  delay  of 
one  sample  (i.e.,  the  estimate  of  x,_j  based  on  the  samples  y,  and  y,_i).  The  two 
classes  of  points  are  no  longer  intermixed  as  they  were  for  the  case  of  no  delay.  A  set 
of  coefficients  for  the  linear  equalizer  can  therefore  be  calculated  that  will  properly 
separate  the  two  sets  of  points.  With  noise  added,  however,  the  sets  of  points  begin  to 
intermix  as  shown  in  Figure  4.21  for  a  signal-to-noise  ratio  of  10  dB.  The  separation 
of  the  two  classes  becomes  more  difficult  particularly  for  the  linear  equalizer  which 
can  only  use  a  single  line  to  define  the  decision  boundary.  The  coefficients  for  the 
linear  equalizer  were  again  calculated  using  the  RLS  algorithm  and  the  500  values 
for  y ,  pictured  in  Figure  4.21.  The  resulting  decision  regions  are  shown  in  Figure 
4.22.  Comparison  of  the  two  decision  regions  with  the  original  training  data  (Figure 
4.21)  indicates  that  the  linear  equalizer  was  unable  to  define  a  single  line  that  could 
separate  all  the  points  into  their  proper  regions.  The  linear  equalizer  produced  a 
total  of  19  errors  over  the  500  values  of  the  training  data  set.  The  same  training  data 
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Figure  4.18:  Linear  equalizer  decision  regions 
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set  was  then  used  to  train  a  neural  network  with  a  2-6-4-1  configuration  using  the 
conjugate  gradient  algorithm.  After  twenty  iterations,  the  neural  network  produced 
the  two  decision  regions  pictured  in  Figure  4.23.  The  boundary  between  the  two 
decision  regions  is  no  longer  a  straight  line  but  is  shaped  to  take  into  account  the 
distribution  of  points  caused  by  the  introduction  of  noise.  The  neural  network  only 
produced  a  total  of  3  errors  over  the  500  value  training  set. 
d.  A  Performance  Comparison 

The  results  from  the  two  above  examples  would  tend  to  indicate  that 
a  neural  network  can  produce  superior  results  to  the  linear  equalizer  both  when 
a  delay  is  introduced  and  when  a  delay  is  not  introduced.  In  order  to  confirm  this 
result,  the  performance  of  both  the  linear  transversal  equalizer  and  the  neural  network 
were  evaluated  for  various  signal-to-noise  ratios.  The  measure  of  performance  for 
the  test  was  the  average  bit  error  probability.  The  four  signal-to-noise  ratios:  5.0 
dB,  10  dB,  20  dB,  and  25  dB  were  used  to  generate  four  different  sets  of  training 
sequences.  Each  sequence  was  generated  using  a  different  signal-to-noise  ratio.  Both 
the  linear  equalizer  and  the  neural  network  were  then  trained  using  these  four  500 
value  sequences  for  y, .  After  calculating  the  coefficients  for  the  linear  equalizer  and 
the  weights  and  thresholds  for  the  neural  network  the  bit  error  performance  of  each 
type  equalizer  was  calculated  by  passing  the  same  100.000  bit  sequence  through  each 
equalizer  and  counting  the  number  of  times  the  equalizer  produced  an  error.  The 
results  for  the  case  where  no  delay  was  used  is  shown  in  Figure  4.24.  As  was  expected, 
the  bit  error  probability  for  the  linear  equalizer  with  no  delay  was  extremely  poor.  The 
bit  error  probability  for  the  neural  network  steadily  dropped  as  the  magnitude  of  the 
noise  fell.  The  lowest  of  the  three  curves  shown  in  Figure  4.24  reflects  the  performance 
of  the  neural  network  at  the  various  sigi.al-to-noise  ratios  after  having  been  trained 
using  the  10  dB  SNR  training  data  set.  Its  performance  is  equal  to  or  better  than  the 
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Figure  4.20:  Possible  combinations  of  y<  and  ?/,•_ t  (with  delay) 
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Figure  4.22:  Linear  equalizer  (with  delay)  decision  regions 


neural  networks  trained  and  evaluated  for  a  specific  SNR.  This  is  because  the  lower 
SNR  forced  the  conjugate  gradient  algorithm  to  produce  a  set  of  decision  boundaries 
that  were  more  nearly  optimal.  This  re?  .’en  more  apparent  for  the  case  when 

a  delay  was  introduced  in  the  equalization  problem  (Figure  4.25).  The  same  method 
was  used  as  described  above,  except  that  both  the  linear  equalizer  and  neural  network 
produced  an  estimate  of  x,_i,  rather  than  x,-,  based  on  the  received  signals  t/,  and 
3/;_i  .  Once  again  the  neural  network  performed  better  than  the  linear  equalizer  and 
the  neural  network  trained  using  10  dB  data  performed  the  best. 

One  final  comparison  can  be  made  between  the  neural  network  and 
the  linear  transversal  equalizer.  This  is  a  comparison  of  neural  network  without  delay 
versus  the  linear  equalizer  with  delay.  This  comparison  is  shown  in  Figure  4.26.  Also 
shown  is  the  neural  network’s  performance  with  a  delay.  The  neural  network  without 
delay  did  not  perform  as  well  as  the  linear  equalizer  for  low  signal-to-noise  ratios. 
As  the  magnitude  of  the  noise  was  reduced,  however,  the  performance  of  the  two 
approaches  became  comparable.  The  neural  network  with  delay,  however,  was  better 
than  any  of  the  approaches. 

e.  Channel  Equalizer  Conclusions 

The  performance  of  both  a  linear  transversal  equalizer  and  a  neural 
network  were  evaluated  with  respect  to  their  ability  to  accurately  equalize  a  nonmin¬ 
imum  phase  digital  data  channel.  It  was  found  that  a  linear  transversal  equalizer  was 
unable  to  accurately  estimate  the  original  signal  because  of  the  channel’s  nonmini¬ 
mum  phase  characteristic.  The  neural  network,  because  of  its  ability  to  form  arbitrary 
boundaries,  did  not  suffer  from  this  problem.  Introduction  of  a  delay  allowed  both 
the  linear  transversal  equalizer  and  the  neural  network  to  improve  their  performance. 
Finally,  a  neural  network  using  no  delay  showed  a  comparable  performance  to  a  linear 
transversal  filter  with  a  delay  for  high  signal-to-noise  ratios.  The  ability  of  the  neural 
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Figure  4.25:  Equalizer  performance  (with  delay; 
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network  to  perform  equalization  without  introduction  of  a  delay  could  prove  useful, 
particularly  if  the  required  delay  is  unknown  or  varies  with  time. 

't 


SNR  ( dB ) 


Figure  4.26:  Equalizer  performance  -  all  methods 
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V.  CONCLUSIONS  AND 
RECOMMENDATIONS 


A.  CONCLUSIONS 

The  first  objective  of  this  thesis  research  was  to  develop  an  alternative  to  the 
backpropagation  method  for  calculating  the  optimal  set  of  weights  and  thresholds  for  a 
neural  network.  The  results  presented  in  Chapter  IV  demonstrated  that  the  conjugate 
gradient  algorithm  developed  for  this  thesis  was  more  computationally  efficient  than 
the  well  know:  backpropagation  method. 

The  second  objective  of  this  research  was  to  develop  a  better  understanding  of 
the  relationship  between  the  structure  of  a  neural  network  and  its  ability  to  perform 
input-to-output  mapping.  A  graphical  approach  was  used  to  analyze  the  internal 
representations  of  the  neural  network.  The  results  of  this  analysis  were  presented  in 
Chapter  II. 

The  final  objective  of  this  thesis  research  was  to  evaluate  the  performance  of  a 
neural  network  for  several  different  signal  processing  applications.  The  first  example 
presented  demonstrated  the  ability  of  a  neural  network  to  perform  classification.  The 
second  example,  nonlinear  time  series  prediction,  compared  the  performance  of  a 
neural  network  to  its  linear  equivalent,  and  showed  that  the  neural  network  produced 
superior  results.  The  final  example  illustrated  the  performance  differences  between  a 
neural  network  and  a  linear  approach  to  nonminimum  phase  channel  equalization. 

These  applications  demonstrated  that  the  nonlinear  properties  of  a  neural  net¬ 
work  frequently  allow  the  neural  network  to  perform  functions  more  effectively  than 
its  linear  counterpart.  This  is  particularly  the  true  when  the  problem  itself  is  nonlin¬ 
ear.  It  must  be  recognized,  however,  that,  there  is  a  cost,  to  this  increased  functionality. 


Calculation  of  the  proper  weights  and  thresholds  for  a  given  problem  is  much  more 
computationally  complex.  The  computational  complexity  associated  with  the  use  of 
a  neural  network  must  therefore  be  balanced  with  the  accuracy  desired  when  decid¬ 
ing  whether  to  use  a  neural  network  rather  than  a  linear  approach  to  solve  a  given 
problem. 


B.  FUTURE  RESEARCH 

In  the  course  of  this  thesis  research,  several  other  areas  were  identified  that 
merit  additional  study. 

1.  Transfer  Function  Selection 

The  sigmoid  function  used  for  this  research  produced  an  output  that  ranged 
between  0  and  1.  Other  transfer  functions  could  be  investigated  that  produce  a  bipolar 
output.  This  could  prove  to  be  more  useful  for  typical  signal  processing  applications. 
One  such  transfer  function  that  could  be  evaluated  is  the  hyperbolic  tangent  function 


tanh(r) 


e:  —  e-: 
e*  +  e"* 


1  -  e  2- 
1  +  e2-'  ' 


(5.1) 


This  nonlinear  function  produces  a  value  which  ranges  between  ±1  and  is  continuously 
differentiable  for  all  values  of  2. 

\  Neural  Network  Dynamic  Range 

The  performance  of  a  neural  network  having  a  greater  dynamic  range  could 
be  investigated.  The  dynamic  range  of  the  neural  network  could  be  expanded  by 
allowing  adaptation  of  the  output  weight  w3.  It  could  also  be  accomplished  by  using 
a  linear  transfer  function  for  the  single  neuron  in  output  layer  of  the  network.  The 
output  of  the  network  would  then  be  a  linear  combination  of  the  weighted  outputs 
from  the  second  layer  of  network.  This  is  approach  taken  by  Lapedes  and  Farber  in 
their  research  (Ref.  llj. 
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3.  Internal  Representations 

This  thesis  made  no  attempt  to  analyze  the  internal  representations  used 
by  the  neural  network  to  produce  the  desired  outputs  for  a  given  set  of  inputs.  Re¬ 
search  could  be  conducted  to  try  to  determine  exactly  what  type  of  functions  the 
individual  neurons  in  the  network  perform.  This  could  provide  further  insight  into 
the  relationship  between  the  structure  of  a  neural  network  and  its  ability  to  perform 
a  particular  task. 

4.  Analysis  of  the  Weights  and  Thresholds 

Research  could  be  performed  to  determine  if  there  is  any  analytical  signif¬ 
icance  to  the  final  weight  and  threshold  values  for  a  neural  network. 


APPENDIX  A:  PROGRAM  OUTPUT  SCREEN 

AND  DATA  FILES 

A.  EXAMPLE  OUTPUT  SCREEN 

**  Conjugate  Gradient  Algorithm  ** 

What  is  the  name  of  the  training  data  file?  circ.dat 

How  many  inputs  to  the  neural  network?  2 

How  many  1st  layer  neurons?  4 

How  many  2nd  layer  neurons?  2 

There  will  be  only  one  3rd  layer  neuron. 

How  many  passes  thru  the  training  data  set?  2 

Initial  Error  sum:  40.2786 

Performing  iteration  number  1 
Beta  value:  0 
Alpha  value:  0.10958 
Error  sum:  17.3579 

Performing  iteration  number  2 
Beta  value:  0.00366825 
Alpha  value:  3.79148 
Error  sum:  17.3564 

Final  error  sum:  17.3557 

Where  do  you  want  the  results  stored?  circ.res 
**  Calculating  final  results  ** 

Where  do  you  want  the  final  weight/theta  values  stored?  circ.wgt 
**  Storing  final  weight/theta  values  ** 

Where  do  you  want  the  map  matrix  stored?  circ.map 
**  Calculating  map  of  network  ** 
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B.  EXAMPLE  INPUT  DATA  FILE 


4.0000e-001 

4.0000e-001 

4.0000e-001 

4.00006-001 

4.0000e-001 

4.0000e-001 

4.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

5.0000e-001 

6.0000e-001 

6.0000e-001 

6.0000e-001 

6.0000e-001 

6.0000e-001 

6.0000e-001 


Input  1 


4.0000e-001 

5.0000e-001 

6.0000e-001 

7.0000e-001 

8.00006-001 

9.0000e-001 

1.00006+000 

0.0000e+000 

1.0000e-001 

2.00006-001 

3.0000e-001 

4.0000e-001 

5.00006-001 

6.00000-001 

7.00000-001 

8.00006-001 

9.00006-001 

1.00000+000 

0.00000+000 

1.00000-001 

2.00000-001 

3.00000-001 

4.00000-001 

5.00000-001 

Input  2 


1.00006+000 
1.0000e+000 
1.00006+000 
1.0000e+000 
0.0000e+000 
0.0000e+000 
0.0000e+000 
0.0000e+000 
0.0000e+000 
0.0000e+000 
l.OOOOe+OOO 
1.0000e+000 
l.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 
0.0000e+000 
0.0000e+000 
0.0000e+000 
0.0000e+000 
0 . 0000e+000 
0 . 0000e+000 
l.OOOOe+OOO 
l.OOOOe+OOO 
l.OOOOe+OOO 

v  v 
Desired  output 
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C.  EXAMPLE  RESULTS  OUTPUT  DATA  FILE 


1.000000e+000 
1. OOOOOOe+OOO 
1. OOOOOOe+OOO 
1 . 000000e+000 
0 . OOOOOOe+OOO 
0 . OOOOOOe+OOO 
0. OOOOOOe+OOO 
0. OOOOOOe+OOO 
0 . OOOOOOe+OOO 
0.000000e+000 
1 . OOOOOOe+OOO 
1 . OOOOOOe+OOO 
1.000000e+000 
1. OOOOOOe+OOO 
1 . OOOOOOe+OOO 
0. OOOOOOe+OOO 
0. OOOOOOe+OOO 
0. OOOOOOe+OOO 
0. OOOOOOe+OOO 
0. OOOOOOe+OOO 
0. OOOOOOe+OOO 
1. OOOOOOe+OOO 
1. OOOOOOe+OOO 
1. OOOOOOe+OOO 


Desired  output 


1.733739e-001 
1.738492e-001 
1.743229e-001 
1.747937e-001 
1.752599e-001 
1.757203e-001 
1.7617366-001 
1.714368e-001 
1.718988e-001 
1.7236596-001 
1.7283656-001 
1.7330926-001 
1.7378256-001 
1.7425486-001 
1.7472476-001 
1.7519066-001 
1.756512e-001 
1.761052e-001 
1.7138746-001 
1.7184526-001 
1.7230866-001 
1 .7277606-001 
1.7324606-001 
1.7371716-001 

Actual  output 


TO 


D.  EXAMPLE  FINAL  WEIGHTS  OUTPUT  DATA  FILE 


2  4  2  }  Number  of  neurons  in  each  layer 


1.023357 

-0.092301 

-0.061310 

-0.796225 

-0.618687 

0.513663 

-0.831033 

-0.279913 

0.262880 

-0.542745 

1.344298 

1.000000 


0.621861  -0.194039  0.288292  1  T  4  .  .  f  . 

-0.595949  0.007433  -0.795105  j  npU  weigl  s  W 

-0.031015  0.059272  0.010853  } Input  thresholds  [0O] 

-0.522201  ‘ 

-0.103397 
0.325973 
0.906917 
0.139071 

|  2nd  layer  weights  [u^] 

[  2nd  layer  threshold  [#2] 

[  Output  weight  [103] 


>  1st  layer  weights  [tui] 

•  1st  layer  thresholds  [0i] 


APPENDIX  B:  PROGRAM  SOURCE  CODE 

LISTING 


♦include  <stdio.h> 

♦include  <stdlib.h> 

# include  <math.h> 

♦include  <float.h> 

/********************************************************************/ 
/*  This  program  calculates  the  weights  and  thresholds  for  a  */ 

/+  feedforward  multilayer  neural  network  using  the  conjugate  */ 

/*  gradient  optimization  method.  */ 

/********************************************************************/ 
/********************************************************************/ 
/*  FUNCTION  DECLARATIONS  */ 

/********************************************************************/ 
int  get.info(char  filenameG  ,int  num_node[]); 
int  get_data(char  filenameG .double  ts_data[] ,int  num.inputs); 
int  init .weights (double  *weight_ptr ,int  num_node[] ) ; 
int  init .thetas (double  *theta_ptr,int  num.nodeD); 
void  adapt_network(double  weight □  .double  thetaG.int  num.nodeD, 
int  num.weights.int  num.theta, double  data.arrayG, 
int  array.size.int  max.iteration) ; 
double  f ire_neurons(double  *activity_ptr ,  double  *weight_ptr, 
double  *theta_ptr,int  num.nodeD); 
void  calc.gradient (double  activity □  .double  weight G, 
double  theta.gradient G  .double  gradientG. 
int  num.node G . int  num.weights.int  num.theta); 
double  calc_beta(double  old.gradientG .double  old.theta.gradient [] , 
double  new.gradient G .double  new.theta.gradient G , 
int  num.inputs , int  num.theta); 

void  update_direction(double  gradient G .double  direction G .double  beta, 
int  num.intputs) ; 

void  update  .weights  (double  weight  G  .double  alpha, double  directionG, 
int  num.inputs) ; 

double  calc_alpha(double  weight G .double  directionG  .double  theta [] , 
double  theta.directionG  .double  activityG. 
double  data.arrayG  , int  array.size.int  num.nodeG, 
int  num.weights.int  num.theta); 

void  load.values (double  * input _ptr, double  *output_ptr , int  total.num); 
int  fibon(int  n) ; 
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void  write.result (double  weight [] .double  thetaG  .int  num.nodeG, 
double  ts_data[] ,int  set.size); 

void  map  .network  (double  weight G .double  theta  □  ,int  num.nodeG); 
void  store.weights (double  weight G .double  *theta..ptr ,  int  num.nodeG); 

/ ******************************************************************** / 

/*  MAIN  PROGRAM  */ 

/********************************************************************/ 
main() 

{ 

chau:  filename [14] ; 

int  max. iteration, num.node [5] .num.weights .set .size, num.theta; 
double  ts_data[30003 .weight [400] ,theta[50] ; 

printf("\n  **  Conjugate  Gradient  Algorithm  **  \n"); 
max.iteration  =  get. inf o(filename, num.node) ; 
set.size  =  get_data(f ilename.ts .data, num.node [0] ) ; 
if  (set.size  ==  0){ 
exit(O) ; 

> 

num_weights=init .weights (weight .num.node) ; 
num.theta* in it .thetas (theta, num.node) ; 

adapt .network (we ight , theta , num. node , num.we ight s , num.thet  a , 
ts.data, set .size, max.iteration) ; 
write.result (weight , theta .num.node , ts.data , set.size) ; 
store.weights (weight , theta .num.node) ; 
if  (num.node [0]  ==  2){ 
map .network (weight .theta, num_node ) ; 

> 

exit(0) ; 

> 

/********* **********************************************************+/ 
/*  FUNCTION  GET. INFO  */ 

/********* *****************  ***** **>m****************** ***************/ 
int  get_info(char  filenameG  ,int  num.nodeG) 

•C 

int  max.iteration; 

printf("\n  What  is  the  name  of  the  training  data  file?  "); 

flushallO; 

gets (filename) ; 

printf("\n  How  many  inputs  to  the  neural  network?  "); 

scanf  ("y,2hd"  ,&num_node[0] ) ; 

printf("\n  How  many  1st  layer  neurons?  "); 

scanf  ("'/,2hd"  .Anum.nodeCl] ) ; 


printf("\n  How  many  2nd  layer  neurons?  "); 
scanf  ("*/,2hd"  .tnum.node [2]  )  ; 

printf("\n  There  will  be  only  one  3rd  layer  neuron.  "); 
num_node[3]  ■  1; 
num.node [4]  *  1; 

printf("\n\n  How  many  passes  thru  the  training  data  set?  "); 
scanf ("%4hd" .imax.iteration) ; 
return (max.iteration) ; 

> 

/********************************************************************/ 
/*  FUNCTION  GET. DATA  */ 

/********************************************************************/ 
int  get_data(char  filename^ .double  ts_data[]  ,int  num.inputs) 

•C 

FILE  *stream; 
int  i, num.read; 

num.read  =  0; 

if  ((stream  «  fopen(tf ilename[0] ,"r"))  !s  NULL){ 
for  (i=0;(i  <  3000) &fc 
(f  scanf  (stream,  "'/,lg",ts_data  +  i)>0);i++) 

9 

fclose(stream) ; 
if  ((i'/,(num_inputs+l))  !*  0){ 

printf("\n\n  **  Improper  number  of  input  data  elements  **"); 
num.read  =  0; 

> 

else{ 

num.read  =  i/(num_inputs+l) ; 

} 

> 

else-( 

printf("\n\n  **  Could  not  find  the  specified  file  **"); 

> 

return(num.read) ; 

> 

/******* ****************************************** ******* ************/ 
/*  FUNCTION  INIT.WEIGHTS  */ 

/**************±****t************************^t ********* *************/ 

int  init .weights (double  *weight_ptr,int  num.nodeC]) 

{ 

f define  MAX.VAL  16384.0 
int  num.weights ,i; 

srand(l) ; 
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num.weights  =  0; 
for  (i=0;i<3;i++){ 

num.weights  +=  num_node[i3*num_node[i+l3 ; 

> 

for  (i=0;i<(num_node[03*num_node[l3) ;i++){ 

♦weight_ptr++  *  (1.0  -  (rand()/MAX_VAL)) ; 

> 

for  (i=0;i<(num_node[l3*num_node[23 ) ;i++){ 

♦weight _ptr++  =  (1.0  -  (rand()/MAX_VAL)) ; 

> 

for  (i*0;i<(num_node[23*num_node[33)  ;i++){ 

♦weight_ptr++  *  (1.0  -  (rand()/MAX_VAL)) ; 

> 

♦weight _ptr  *  1.0; 
num.weights  +=  1; 
return(num_weights) ; 

> 

/********************************************************************/ 
/*  FUNCTION  INIT.THETAS  */ 

/********************************************************************/ 
int  init .thetas (double  ♦theta.ptr.int  num.nodeG) 

{ 

int  num_theta,i; 

num.theta  =  num.node [13  +num_node [23  +num_node [33 ; 
for  (i=0;i<num_theta;i++){ 

*theta_ptr++  *  0.0; 

} 

return (num.theta) ; 

> 

/********************************************************************/ 
/*  FUNCTION  ADAPT.NETWORK  */ 

/********************************************************************/ 
void  adapt  .network  (double  weight  □  .double  thetaG.int  num.nodeG, 
int  num.weights, int  num.theta, double  data.arrayG, 
int  array.size.int  max. iteration) 

{ 

int  iteration, i, j ,set_num; 

double  activity [503 .gradient [4003 .direction [4003 ,gradient_sum[400] ; 
double  actual.output .desired.output, alpha, beta, old.gradient.mag; 
double  theta.gradient [503 ,theta_sura[503 ,theta_direction[503 ; 
double  old_gradient_sum[503 ,old_theta_sum[503 .error , error sum; 
double  *array_ptr; 

for  (iteration=0;iteration<max.iteration; iteration++){ 
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for  (i=0;i<num_weights;i++H 
gradient_sum[i]  =0.0; 

> 

for  (i*0;i<num_theta; i++){ 
theta.sum[i]  *  0.0; 

> 

error sum=0 . 0 ; 
array.ptr  =  data.array; 

for  (set_num=0 ; set _num<array .size ; set_num++H 
for  (i=0;i<num.node[0]  ;i++H 
activity [i]  =  *array_ptr++; 

> 

desired.output  =  *array_ptr++; 

actual _output*f ir e.neurons (activity , weight , theta , num.node) ; 
error  =  actual .output  -  desired.output; 
error  *=  error; 

gradient  [num_weights-l]  *  (actual .output  -  desired.output)* 
actual .output ; 

calc_gradient(activity .weight .theta.gradient .gradient .num.node , 
num.weights.num.theta) ; 

for  (i=0;i<(num.weights-l) ;i++){ 
gradient_sum[i]  +=  gradient [i] ; 

> 

for  (i=0;i<num_theta;i++K 
theta_sum[i]  +=  theta.gradient [i] ; 

> 

errorsum  +=  error; 

> 

printf("  Error  sum:  ‘/,lg  \n" .errorsum) ; 
if  (iteration  ==  0){ 
beta  =  0.0; 

> 

else{ 

beta  =  calc_beta(old_gradient_sum,old_theta_sum,gradient_sum, 
theta.sum,  (num_weights-l) .num.theta) ; 

> 

for  (j=0; j<(num.weights-l) ; j++){ 

old_gradient_sum[j]  =  gradient_sum[j] ; 

> 

for  (j=0; j<num.theta; j++){ 

old.theta.sumCj]  =  theta.sumCj] ; 

> 

printf("\n  Performing  iteration  number  */,d  \n" ,  (iteration+1)) ; 
printfC  Beta  value:  '/,lg  \n",beta); 

update_direction(gradient_sum, dir 3Ction, beta, (num.weights-l) ) ; 
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update_direction(theta_sum,theta_direction, beta, num.theta) ; 
alpha*calc_alpha(weight .direction , theta, theta.direct ion , activity , 
data.array , array .size , num.node .num.weights , 
num.theta) ; 

printfC  Alpha  value:  V,lg  \n", alpha); 
update.weights (weight, alpha, direction, (num.weights- 1)) ; 
update.weights (theta, alpha, theta.direction, num.theta) ; 

> 

errorsum  *  0.0; 
array _ptr  *  data.array; 

for  (set _nua«0; set _nua<array .size; set. num++K 
for  (i*0;i<num.node[0] ;i++){ 
activityCi]  *  ♦array ,ptr++; 

> 

desired.output  *  ♦array _ptr++; 

actual_output*fire_neurons (activity, weight, theta, num.node) ; 
error  *  actual.output  -  desired.output; 
error  ♦■  error; 
errorsum  +*  error; 

> 

printf("\n  Final  error  sum:  */,lg  \n"  .errorsum) ; 
return; 

> 

/***************************#*♦***♦♦♦♦**♦♦♦♦**♦♦♦*♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦*♦♦♦/ 

/♦  FUNCTION  FIRE.NEURQNS  ♦/ 

/*****************************************♦♦****************«*  *******/ 
double  fire_neurons(double  *activity_ptr .double  *weight_ptr, 
double  ♦theta.ptr , int  num.node []) 

{ 

int  layer.num.neuron.num, j ; 
double  mp , * input _ptr , *output_ptr ; 

input _ptr  =  activity.ptr; 

output.ptr  =  activity.ptr  +  num_node[0]; 

/♦  Feed  input  forward  thru  each  layer  of  the  network  ♦/ 
for  (layer_num=0; layer. num<3; layer _num++H 
for  (neuron_numa0;neuron_nura  <  num.node [layer _num+ 1] ;neuron_num++){ 
temp  -  0.0; 

for  (j*0;j  <  num.node [layer _num] ;j++){ 
temp  -=  (*weight.ptr++)*(input_ptr[j]) ; 

} 

temp  ♦*  ♦theta_ptr++; 

♦output _ptr++  =  1 .0/(1 .0+exp(temp)) ; 

> 
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input _ptr  ♦*  num.node [layer.num] ; 

> 

temp  *  (*input.ptr)  *  (♦weight.ptr) ; 
return(temp) ; 

> 

/******♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦**♦♦♦*♦/ 

/♦  FUNCTION  CALC .GRADIENT  ♦/ 

/*****♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦/ 
void  calc .gradient (double  activity  □  , double  weight [] , 
double  theta.gradient [] , double  gradient [] , 
int  num.node [], int  num_weights,int  num.theta) 

{ 

int  layer.num.i, j .offset; 

double  ♦weight _ptr,*gradient_ptr, *result . gradient _ptr; 
double  ♦output  _acty  _ptr , * input .act y .pt r , t  emp , *thet  a.ptr ; 

weight _ptr  *  iweight [num_weights-l] ; 
gradient.ptr  *  *gradient[num_weights-l] ; 
result .gradient _ptr  *  gradient.ptr  -  1; 

output .acty.ptr  =  Inactivity [0]  +  (num.node [0]+num.node[l]+num.node [2] ) ; 
input.accy.ptr  *  output.acty.ptr  -  1; 
theta.ptr  =  fttheta.gradient [num.theta- 1] ; 

for  (layer.num  *  2 ; layer .num>-l; layer.num — ){ 
for  (j=0;j<num.node [layer.num  +  l];j++){ 
temp  =0.0; 
offset  =  0; 

for  (i=0;i<num.node [layer _num+2]  ;i++X 
temp  +=  (*weight_ptr)  *  (*gradient_ptr) ; 
weight _ptr  -=  num.node [layer _num+l] ; 
gradient.ptr  -=  num.node [lay er_num+ 1] ; 

> 

offset  =  (num.node [layer_num+2] ♦num.node [layer_num+l] )-l ; 

weight _ptr  +=  offset; 

gradient.ptr  +=  offset; 

temp  ♦=  (1.0  -  (♦output, acty.ptr — )); 

for  (i=0;i<num.node [layer.num] ;i++){ 

(♦result.gradient.ptr--)  =  temp  *  (* input .acty.ptr — ); 

} 

♦theta.ptr —  =  (-temp) ; 
input.acty.ptr  +=  num.node  [layer.num] ; 

} 

input.acty.ptr  -=  num.node [layer.num] ; 

> 

return ; 
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> 

/************************************************************* *******/ 
/*  FUNCTION  UPDATE.WEIGHTS  */ 

/********************************************************************/ 
void  update_weights(double  weight [] .double  alpha,  double  direction [] , 
int  num. inputs) 

{ 

int  i; 

for  (i«*0;i<num.inputs;i++){ 
weight [i]  ♦*  alpha*direction[i] ; 

> 

return; 

> 

/********************************************************************/ 
/*  FUNCTION  CALC.BETA  */ 

/*******************************%************************************/ 
double  calc.beta(double  old.gradient  □ .double  old.theta.gradientG  , 
double  new.gradient G .double  new.theta.gradient [] , 
int  num. input s, int  num.theta) 

{ 

int  i; 

double  beta, tempi, temp2; 

tempi  ■  0.0; 
temp2  =  0.0; 

for  (i=0;i<num.inputs;i++H 

tempi  +*  ( (new.gradient [i] -old.gradient [i] )*new_gradient [i] ) ; 
temp2  +=  (old.gradient [i]  *  old.gradient [i] ) ; 

> 

for  (i=0;i<num_theta;i++H 

tempi  +*  ( (new.theta.gradient [i] -old_theta_gradient[i] )* 
new.theta.gradient [i] ) ; 

temp2  +*  (old_theta_gradient[i]  *  old.theta_gradientCi]) ; 

> 

beta  ■  templ/temp2; 
if  (beta  <  0.0){ 
beta  *  0.0; 

> 

return (beta) ; 

> 

/+♦***♦*** *******N ************************ ***************************/ 

/*  FUNCTION  UPDATE.DIRECTION  */ 

/********************************************************************/ 
void  update.direction (double  gradient[],  double  directionG, 
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double  beta,  int  num_ inputs) 


{ 

int  i; 

for  (i=0;i<num.inputs;i++){ 
direction [i]  *=  beta; 
direction [i]  -=  gradient  [i] ; 

> 

return; 

> 

/i m*******^*********************************************************/ 

/*  FUNCTION  CALC.ALPHA  */ 

/t*****^**^*^*****************************************************/ 

double  calc_alpha(double  weight [3 .double  direction [] .double  theta[], 
double  theta_direction[]  .double  activity!!], 
double  data_array[] ,int  array_size,int  num.nodeD, 
int  num.weights, int  num.thata) 

{ 

double  a ,b , lamda ,mu , lamda.result .mu.result , desired.result , epsilon ; 
double  actual .result, test .weight [500] ,test_theta[50] ,*array_ptr; 
int  i, k, set _num, max .steps; 

a  =  0.0; 
b  =  10.0; 
max.steps  =  16; 
epsilon  =  0.001; 

lamda  *  a+((b-a)*f ibon(max_steps-2)/f ibon (max.steps)) ; 
mu  =  a+((b-a)*f ibon(max_steps-l)/f ibon (max.steps)) ; 
a  -=  lamda; 
b  -=  lamda; 
mu  -=  lamda; 
lamda  =0.0; 

load.values (weight, test .weight .num.weights) ; 
load.values (theta, test_theta,num_theta) ; 

update.weights (test .weight , lamda , direction , (num.weights- 1 ) ) ; 
update.weights (test.theta, lamda .theta.direct ion, num.theta) ; 
lamda.result  =0.0; 
array .ptr  =  data.array; 

for  (set _num=0; set _num<array .size; set _num++){ 
for  (i=0;i<num_node[0] ;i++){ 
activity [i]  =  *array_ptr++; 

> 

desired.result  =  *array_ptr++; 

actual_result=fire_neurons(activity .test.weight ,test_theta, 
num.node) ; 
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actual.result  --  desired.result; 
actual.result  *=  actual.result ; 
lamda.result  +=  actual.result; 

> 

load.values (weight, test .weight .num.weights) ; 

load.values (theta , test .theta , num.theta) ; 

update. weights (test.weight, mu, direction, (num_weights-l)) ; 

update.weights (test .theta ,mu , thet a.direction , num.theta) ; 

mu.result  *  0.0; 

array _ptr  =  data.array; 

for  ( set _num=0 ; s  et  _num<arr ay  _s ize ; s  et  _num++ ) { 
for  (i=0;i<num_node[0] ;i++){ 
activity [i]  =  *array_ptr++; 

> 

desired.result  =  *array_ptr++; 

actual_result=fire_neurons (activity .test.weight .test .theta, 
num.node) ; 

actual.result  -=  desired.result; 
actual.result  *=  actual.result; 
mu.result  +=  actual.result; 

> 

for  (k=l; (k< (max .steps- l))ft&(b>0.0) ;k++){ 
if  (lamda.result  >  mu.result ){ 
a  =  lamda; 
lamda  =  mu; 

lamda.result  =  mu.result; 
mu  =  ((b-a)/fibon(max_steps-k)) ; 
mu  *=  f ibon(max.steps-k-l) ; 
mu  +=  a; 

load.values (weight .test .weight, num.weights) ; 
load.values (theta, test .theta .num.theta) ; 
update.weights (test.weight , mu, direct ion, (num.weights- 1)) ; 
update.weights (test.theta, mu, theta.direction, num.theta) ; 
mu.result  =  0.0; 
array _ptr  =  data.array; 

for  (set _num=0;set.num<array .size; set _num++){ 
for  (i=s0;i<num.node[0]  ;i++){ 
activity  [i]  =  *array_ptr++; 

> 

desired.result  =  ♦array _ptr++; 

actual_result=fire_neurons(activity, test.weight .test.theta, 
num.node) ; 

actual.result  -=  desired.result; 
actual.result  *=  actual.result; 
mu.result  +=  actual.result; 
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> 


else{ 
b  =  mu; 
mu  =  lamda; 

mu.result  =  lamda.result ; 
lamda  =  ((b-a)/fibon(max_steps-k)) ; 
lamda  *=  fibon(max_steps-k-2) ; 
lamda  +=  a; 

load.values (weight ,test_weight .num.weights) ; 
load.values (theta, test.theta, num.theta) ; 

update.weights (test .weight, lamda, direction, (num.weights- 1)) ; 
update.weights (test .theta, lamda, theta_direction,num_theta) ; 
lamda.result  *  0.0; 
array _ptr  =  data.array; 

for  (set _num=0; set _num<array .size; set_num++){ 
for  (i=0;i<num_node[0] ;i++){ 
activity [i]  *  *array_ptr++; 

> 

desired.result  =  *array_ptr++; 

actual _result=f ire_neurons(activity,test_weight, test .theta, 
num.node) ; 

actual.result  --  desired.result; 
actual.result  *=  actual.result; 
mda.result  +=  actual.result; 

> 


if  ibPv 

mu  -  lamda  +  epsilon; 

load.values (weight .test.weight , num.weights) ; 
load.values (theta, test .theta, num.theta) ; 
update.weights (test.waight , mu, direction, (num_weights-l)) ; 
update.weights (test .theta, mu, theta.direct ion, num.theta) ; 
mu.result  =  0.0; 
array _ptr  =  data.array; 

for  (set _num=0;set_num< array .size ; set_num++H 
for  (i=0;i<num.node[0]  ;i++M 
activity [i]  =  *array_ptr++; 

> 

desired.result  =  *array_ptr++ ; 

actual_result*f ire.neurons (activity .test.weight .test .theta, 
num.node) ; 

actual.result  -=  desired.result; 
actual.result  *=  actual.result; 
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mu.result  +=  actual .result ; 

> 

if  (lamda.result  >  mu.resultH 
if  ((lamda+b)>  O.OH 
return((lait<da+b)/2.0) ; 

> 

elset 

return(O.O); 

> 

> 

else{ 

if  ((lamda+a)>  O.O)  { 
return((lamda+a)/2.0) ; 

> 

else{ 
return(O.O) ; 

> 

> 

> 

else{ 

return(O.O) ; 

> 

> 

/********************************************************************/ 
/*  FUNCTION  LOAD .VALUES  */ 

/********************** *************** ************** **** ** *  *  * * ** *  * *> */ 
void  load.values (double  *input_ptr .double  *output_ptr ,int  total.num) 

{ 

int  i; 

for  (i=0; i<total_num; i++){ 

♦output _ptr++  =  *input_ptr++; 

> 

return ; 

> 

/********************************************************************/ 
/*  FUNCTION  FIBON  */ 

/********************************************************************/ 
int  fibon(int  n) 

{ 

int  f0,fl,f2,k; 

f 2=f l=f 0=1 ; 
if  (n  <  2){ 
return(l) ; 
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> 

for  (k*l;k<n;k++)< 
fO  =  fl  +  f2; 
f2  *  fl; 
fl  =  fO; 

> 

return CfC) ; 

> 

/********************************************************************/ 
/*  FUNCTION  HRITE.RESULT  */ 

/********************************************************************/ 
void  write.result (double  weight [] .double  theta[],int  num_node[], 
double  ts.dataG ,int  set.size) 

{ 

FILE  *fileptr; 
chaor  f  name  [14]  ; 
int  i.set.num; 

double  desired.result, result .activity [50] ,*array_ptr; 

printf ("\n\n  Where  do  you  want  the  results  stored?  "); 
flushallQ ; 
gets (ftf name [0] ) ; 

printf ("\n  **  Calculating  final  results  **  \n"); 
fileptr  *  f open (&f name [0] ,"w") ; 
array _ptr  =  ts.data; 

for  (set_num=0 ; set _num<set .size; set _num++){ 
for  (i=0;i<num.node[0]  ;i+0{ 
activity [i]  =  *array_ptr++; 

> 

desired.result  =  *array_ptr++; 

result  =  fire.neurons (activity .weight, theta, num.node) ; 
fprintf  (fileptr,"  '/,e  '/,e  \n"  ,desired_result  .result) ; 

> 

fclose(fileptr) ; 
return ; 

> 

/********************************************************************/ 
/*  FUNCTION  MAP .NETWORK  */ 

/********************************************************************/ 
void  map.network (double  weight [] .double  theta[],int  num_node[]) 

{ 

int  row, col; 

double  result , input  1 , input2 , activity [50] ; 

FILE  *fileptr; 
chai  f name [13]; 


90 


printf("\n\n  Where  do  you  want  the  map  matrix  stored?  "); 
flushallO ; 
gets (if name [0] ) ; 

printf("\n  **  Calculating  map  of  network  **\n"); 
fileptr  *  f open (4f name [0] ,"w") ; 
input l-input2=0 . 0 ; 
for  (row=0;row<21;row++H 
for  (col=0;col<21;col++M 
activity [0] = input 1 ; 
activity  Cl] *input2 ; 

result*f ire .neurons ( activity , weight , theta .num.node ) ; 
fprintf  (fileptr,"  ‘/,e"  .result) ; 
input 1  +=  0.05; 

> 

fprintf (fileptr, "\n") ; 
inputl  =  0.0; 
input2  +=  0.05; 

> 

fclose(f ileptr) ; 
return ; 

> 

/t****<M** ******************************************** ft***********.***/ 

/*  FUNCTION  STORE.WEIGHTS  */ 

/***************************************************** ***************/ 
void  store.weights (double  weight [] .double  *theta_ptr  int  num_node[]) 

int  i.j.k; 

double  *wcight_ptrl , *weight_ptr2 ; 
char  fname[13]; 

FILE  *f ileptr; 

printf("\n\n  Where  do  you  want  the  final  weight/theta  values  stored?  "); 

flushallO; 

gets (*f name [0] ) ; 

printf("\n  **  Storing  final  weight/theta  values  **\n”); 
fileptr  =  f open (ftf name [0] ,"w") ; 
for  (i=0;i<3;i++){ 
fprintf  (fileptr ,  "'/,4d"  .num.node  [i] ) ; 

> 

fprintf (fileptr ,"\n") ; 
weight_ptr2  =  weight; 
for  (i=0;i<3;i++){ 
weight _ptrl  =  weight _ptr2; 
for  (j=0;  j <num_node [i]  ;  j ++){ 
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weight _ptrl  *  weight _ptr2  +  j; 
for  (k*0;k<num_node[i+l] ;k++){ 
fprintf (fileptr,"'/,10.61f  ",  *weight_ptrl) ; 
weight.ptri  +*  num, node[i] ; 

> 

fprintf (fileptr,"  \n"); 

> 

weight_ptr2  +s  (num_node[i]*nuin_node[i+l] ) ; 
for  (j*0; j<num_node[i+l] ; j++){ 

fprintf  (fileptr 10 .61f  ",  *theta_ptr++) ; 

> 

fprintf (fileptr,"  \n"); 

> 

fprintf  (fileptr, "'/, 10. 61f  \n"  ,*weight_ptr2) ; 
f close(f ileptr) ; 
return ; 

> 
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